[QE-users] Out of Memory on large available RAM for QE-GPU version

Tue Jan 26 12:58:00 CET 2021

Dear Community,

I have compiled Quantum ESPRESSO (Program PWSCF v.6.7MaX) for GPU acceleration (hybrid MPI/OpenMP) with the next options:

              module load compiler/intel/2020.1
              module load hpc_sdk/20.9

              ./configure F90=pgf90 CC=pgcc MPIF90=mpif90 --with-cuda=yes --enable-cuda-env-check=no --with-cuda-runtime=11.0 --with-cuda-cc=70 --enable-openmp BLAS_LIBS='-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core'

              make -j8 pw

Apparently the compilation ends successfully and then, I execute the program:
              module load compiler/intel/2020.1
              module load hpc_sdk/20.9

              export OMP_NUM_THREADS=1
              mpirun -n 2 /home/my_user/q-e-gpu-qe-gpu-6.7/bin/pw.x < silverslab32.in > silver4.out

Then, the program starts and output:
    Parallel version (MPI & OpenMP), running on       8 processor cores
    Number of MPI processes:                 2
    Threads/MPI process:                     4
              ...
              GPU acceleration is ACTIVE
              ...
    Estimated max dynamical RAM per process >      13.87 GB
    Estimated total dynamical RAM >      27.75 GB

But after 2 minutes of execution the job ends with error:

0: ALLOCATE: 4345479360 bytes requested; status = 2(out of memory)
0: ALLOCATE: 4345482096 bytes requested; status = 2(out of memory)
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[47946,1],1]
  Exit code:    127
--------------------------------------------------------------------------

This node has > 180GB of available RAM. With the top commands this is the memory consume:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
89681 my_user   20   0   30.1g   3.6g   2.1g R 100.0  1.9   1:39.45 pw.x
89682 my_user   20   0   29.8g   3.2g   2.0g R 100.0  1.7   1:39.30 pw.x

When the RES memory arise the 4GB the processes stop and the error is displayed

This are the characteristics of the node:
(base) [my_user at gpu001]$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 33 34 35 36 37 38 39 40 41
node 0 size: 95313 MB
node 0 free: 41972 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 1 size: 96746 MB
node 1 free: 70751 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

(base) [my_user at gpu001]$ free -lm
              total        used        free      shared  buff/cache   available
Mem:         192059        2561      112716         260       76781      188505
Low:         192059       79342      112716
High:             0           0           0
Swap:          8191           0        8191

(base) [my_user at gpu001]$ ulimit -a
core file size             (blocks, -c) 0
data seg size              (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 768049
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 100000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

The version of MPI is: (Open MPI) 3.1.5

This node is a compute node in a cluster, but no matter if I submit the job with SLURM or run it directly on the node, the error is the same.
Note that I compile it on the login node and run it on this GPU node, the difference is that on the login node it has no GPU connected.
I would really appreciate it if you could help me figure out what could be going on.
Thank you.

Ms.C. Sandra Romero Molina
Ph.D. student
Computational Biochemistry
T03 R01 D48
Faculty of Biology
University of Duisburg-Essen
Universitätsstr. 2, 45117 Essen
emails: sandra.romero-molina at uni-due.de<mailto:sandra.romero-molina at uni-due.de>
Phone: +49 176 2341 8772
ORCID: https://orcid.org/0000-0002-4990-1649

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210126/5e8cd53c/attachment.html>