[QE-users] Out of Memory on large available RAM for QE-GPU version

Pietro Bonfa' pietro.bonfa at unipr.it
Tue Jan 26 14:14:22 CET 2021


Dear Sandra,

unfortunately PW still doesn't predict the amount of GPU memory to be 
used during the simulation, but the estimate for RAM is also a good 
guess for the GPU memory.

The error message that you see is actually a failed allocation on the 
GPU side, not on the RAM.

Even if you had nvidia cards with 16 GB of memory, the prediction, that 
in your case reads

Estimated max dynamical RAM per process >      13.87 GB

is, as you can see, generally underestimated. I wouldn't be surprised of 
a 15-20% inaccuracy.

Hope this helps,
best,
Pietro

-- 
Pietro Bonfà
Department of Mathematical, Physical and Computer Sciences
University of Parma


On 1/26/21 12:58 PM, Romero Molina, Sandra wrote:
> Dear Community,
> 
> I have compiled Quantum ESPRESSO (Program PWSCF v.6.7MaX) for GPU 
> acceleration (hybrid MPI/OpenMP) with the next options:
> 
>                module load compiler/intel/2020.1
> 
>                module load hpc_sdk/20.9
> 
>                ./configure F90=pgf90 CC=pgcc MPIF90=mpif90 
> --with-cuda=yes --enable-cuda-env-check=no --with-cuda-runtime=11.0 
> --with-cuda-cc=70 --enable-openmp BLAS_LIBS='-lmkl_intel_lp64 
> -lmkl_intel_thread -lmkl_core'
> 
>                make -j8 pw
> 
> Apparently the compilation ends successfully and then, I execute the 
> program:
> 
>                module load compiler/intel/2020.1
> 
>                module load hpc_sdk/20.9
> 
>                export OMP_NUM_THREADS=1
> 
>                mpirun -n 2 /home/my_user/q-e-gpu-qe-gpu-6.7/bin/pw.x < 
> silverslab32.in > silver4.out
> 
> Then, the program starts and output:
> 
>      Parallel version (MPI & OpenMP), running on       8 processor cores
> 
>      Number of MPI processes:                 2
> 
>      Threads/MPI process:                     4
> 
>                ...
> 
>                GPU acceleration is ACTIVE
> 
>                ...
> 
>      Estimated max dynamical RAM per process >      13.87 GB
> 
>      Estimated total dynamical RAM >      27.75 GB
> 
> But after 2 minutes of execution the job ends with error:
> 
> 0: ALLOCATE: 4345479360 bytes requested; status = 2(out of memory)
> 
> 0: ALLOCATE: 4345482096 bytes requested; status = 2(out of memory)
> 
> --------------------------------------------------------------------------
> 
> Primary job  terminated normally, but 1 process returned
> 
> a non-zero exit code. Per user-direction, the job has been aborted.
> 
> --------------------------------------------------------------------------
> 
> --------------------------------------------------------------------------
> 
> mpirun detected that one or more processes exited with non-zero status, 
> thus causing
> 
> the job to be terminated. The first process to do so was:
> 
>    Process name: [[47946,1],1]
> 
>    Exit code:    127
> 
> --------------------------------------------------------------------------
> 
> This node has > 180GB of available RAM. With the top commands this is 
> the memory consume:
> 
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ 
> COMMAND
> 
> 89681 my_user   20   0   30.1g   3.6g   2.1g R 100.0  1.9   1:39.45 pw.x
> 
> 89682 my_user   20   0   29.8g   3.2g   2.0g R 100.0  1.7   1:39.30 pw.x
> 
> When the RES memory arise the 4GB the processes stop and the error is 
> displayed
> 
> This are the characteristics of the node:
> 
> (base) [my_user at gpu001]$ numactl --hardware
> 
> available: 2 nodes (0-1)
> 
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 33 34 35 36 
> 37 38 39 40 41
> 
> node 0 size: 95313 MB
> 
> node 0 free: 41972 MB
> 
> node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 
> 48 49 50 51 52 53 54 55
> 
> node 1 size: 96746 MB
> 
> node 1 free: 70751 MB
> 
> node distances:
> 
> node   0   1
> 
>    0:  10  21
> 
>    1:  21  10
> 
> (base) [my_user at gpu001]$ free -lm
> 
>                total        used        free      shared  buff/cache   
> available
> 
> Mem:         192059        2561      112716         260       76781      
> 188505
> 
> Low:         192059       79342      112716
> 
> High:             0           0           0
> 
> Swap:          8191           0        8191
> 
> (base) [my_user at gpu001]$ ulimit -a
> 
> core file size             (blocks, -c) 0
> 
> data seg size              (kbytes, -d) unlimited
> 
> scheduling priority             (-e) 0
> 
> file size               (blocks, -f) unlimited
> 
> pending signals                 (-i) 768049
> 
> max locked memory       (kbytes, -l) unlimited
> 
> max memory size         (kbytes, -m) unlimited
> 
> open files                      (-n) 100000
> 
> pipe size            (512 bytes, -p) 8
> 
> POSIX message queues     (bytes, -q) 819200
> 
> real-time priority              (-r) 0
> 
> stack size              (kbytes, -s) unlimited
> 
> cpu time               (seconds, -t) unlimited
> 
> max user processes              (-u) 4096
> 
> virtual memory          (kbytes, -v) unlimited
> 
> file locks                      (-x) unlimited
> 
> The version of MPI is: (Open MPI) 3.1.5
> 
> This node is a compute node in a cluster, but no matter if I submit the 
> job with SLURM or run it directly on the node, the error is the same.
> 
> Note that I compile it on the login node and run it on this GPU node, 
> the difference is that on the login node it has no GPU connected.
> 
> I would really appreciate it if you could help me figure out what could 
> be going on.
> 
> Thank you.
> 
> Ms.C. Sandra Romero Molina
> Ph.D. student
> Computational Biochemistry
> 
> T03 R01 D48
> Faculty of Biology
> University of Duisburg-Essen
> Universitätsstr. 2, 45117 Essen
> emails: sandra.romero-molina at uni-due.de 
> <mailto:sandra.romero-molina at uni-due.de>
> 
> Phone: +49 176 2341 8772
> ORCID: https://orcid.org/0000-0002-4990-1649 
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Forcid.org%2F0000-0002-4990-1649&data=04%7C01%7Cpietro.bonfa%40unipr.it%7Ca1de5ab08af943f66b1708d8c1f1b760%7Cbb064bc5b7a841ecbabed7beb3faeb1c%7C0%7C0%7C637472591941309421%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HXIms%2FJQluVA0107%2FdJsGrRA1216p7OHfIqFWeEw1l8%3D&reserved=0>
> 
> 
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
> 


More information about the users mailing list