<div dir="ltr">Indipendently by the presence of a GPU, it is good practice NOT oversubscribe physical cores. <div><br></div><div>So, made up example, if your socket has 128 cores and ypou want to use 16 MPI then The number of OpenMP thread is 8 (128/16). If you specify more, you oversubscribe and as result performance may suck. It is also good practice have a MPI:GPU ration of 1:1 or maybe 2:1. But start with 1:1. <br clear="all"><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div><br></div><div>Regarding the discrepancy in the atomic force I let the developers comment. If you really believe it is a bug, open a bug report on the GitLab <a href="https://gitlab.com/QEF/q-e/-/issues">https://gitlab.com/QEF/q-e/-/issues</a>  and provide everything needed to reproduce the error.</div><div><br></div><div>HTH</div><div><br></div><div>--<br><span style="font-size:12.8px">Filippo SPIGA ~ <a href="http://fspiga.github.io" target="_blank">http://fspiga.github.io</a> ~ skype: filippo.spiga</span></div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 14 Mar 2022 at 17:51, Manish Kumar <<a href="mailto:manish.kumar@acads.iiserpune.ac.in">manish.kumar@acads.iiserpune.ac.in</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>  Dear Filippo, <div><br></div><div>Thank you very much for your reply.</div><div><br></div><div>The "# of threads" is the value of OMP_NUM_TRHREADS. I used nGPU=4 and OMP_NUM_TRHREADS=48. I think the combination is not appropriate. The OMP_NUM_TRHREADS value should not be higher than 12. Am I correct?</div><div><br></div><div>On one node, I am able to run the calculation. For a bigger system (388 atoms, 3604 electrons) I used multiple nodes (2 to 4 nodes each with 4 GPUs). The calculation got killed during the force calculation with the following error messages: </div><div><br></div><div>%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%<br>     Error in routine addusforce_gpu (1):<br>     cannot allocate buffers<br>%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%<br></div><div><br></div><div>The slurm script (for 2 nodes) for the about calculation is the following: <br></div><div>#-------------------------------------------------</div><div>#SBATCH --nodes=2<br></div><div>#SBATCH --gres=gpu:4<br>#SBATCH --ntasks=8<br>#SBATCH --ntasks-per-node=4<br>#SBATCH --cpus-per-task=12</div><div><br>export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK</div><div><br></div><div>mpirun -np 8 pw.x -inp <a href="http://input.in" target="_blank">input.in</a></div><div>or </div><div>mpirun -np 8 --map-by ppr:4:node:PE=12 pw.x -inp <a href="http://input.in" target="_blank">input.in</a><br></div><div>#---------------------------------------------</div><div><br></div><div>I cannot solve or understand the root cause of this error. Do you have any suggestions to resolve it? <br></div><div>Also, I would appreciate your comments on the discrepancy between CPU and GPU, which I mentioned in my previous email. </div><div><br></div><div>Thank you in advance!</div><div><br></div><div>Best regards</div><div>Manish Kumar</div><div>IISER Pune, India</div><div><br></div></div></div><div hspace="streak-pt-mark" style="max-height:1px"><img alt="" style="width: 0px; max-height: 0px; overflow: hidden;" src="https://mailfoogae.appspot.com/t?sender=abWFuaXNoLmt1bWFyQGFjYWRzLmlpc2VycHVuZS5hYy5pbg%3D%3D&type=zerocontent&guid=1aec01f7-ba78-4283-9b25-22dc8ed28981"><font color="#ffffff" size="1">ᐧ</font></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Mar 4, 2022 at 3:05 PM Filippo Spiga <<a href="mailto:spiga.filippo@gmail.com" target="_blank">spiga.filippo@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Ops, typo while typing from the phone...<div><br></div><div>"are you using OMP_NUM_THREADS=48 or OMP_NUM_THREADS=12?"</div><div><br></div><div>(everything else is correct)</div><div><br clear="all"><div><div dir="ltr"><div dir="ltr"><div><div>--<br><span style="font-size:12.8px">Filippo SPIGA ~ <a href="http://fspiga.github.io" target="_blank">http://fspiga.github.io</a> ~ skype: filippo.spiga</span></div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 4 Mar 2022 at 09:33, Filippo Spiga <<a href="mailto:spiga.filippo@gmail.com" target="_blank">spiga.filippo@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Dear Manish,<div><br></div><div>when you use nGPU=4, the "# of Threads" column specify the aggregate number of threads? Meaning, are you using OMP_NUM_TRHREADS=48 or OMP_NUM_TRHREADS=48? From you email it is not clear and, if you oversubscribe physical cores with threads or processes then performance is not going to be great.</div><div><br></div><div>Also, you must manage bindings properly otherwise MPI processed bind to GPU on another socket need top cross the awful CPU-to-CPU link. Have a look at '--map-by' option in mpirun. For 4 GPU, using 4 MPI processes and 12 OpenMP threads, your mpirun will look like this:</div><div><br></div><div><font face="monospace">export OMP_NUM_THRTEADS=12</font></div><div><font face="monospace">mpirun -np 4 --map-by ppr:4:node:PE=12 ./pw.x </font></div><div><br></div><div>If you are running on a HPC system managed by someone else, try reach out the User Support and get guidance on correct binding and environment. What you are observing is very likely not related to QE-GPU but how you are running your calculations. <br></div><div><br></div><div>HTH</div><div><br clear="all"><div><div dir="ltr"><div dir="ltr"><div><div>--<br><span style="font-size:12.8px">Filippo SPIGA ~ <a href="http://fspiga.github.io" target="_blank">http://fspiga.github.io</a> ~ skype: filippo.spiga</span></div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 2 Mar 2022 at 08:40, Manish Kumar <<a href="mailto:manish.kumar@acads.iiserpune.ac.in" target="_blank">manish.kumar@acads.iiserpune.ac.in</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div>Dear all, </div><div><br></div><div>I am using QE-GPU compiled on a 48-core Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz and four NVIDIA V100 GPU cards. To use all the CPUs, I am using the OMP_NUM_THREADS variable in the slurm script. The jobs are run with "mpirun -np [nGPU] pw.x", where nGPU refers to the number of GPUs used. Our system size (130 electrons and 64 k-points, the input file is given below) is comparable to some systems in J. Chem. Phys. 152, 154105 (2020); <a href="https://doi.org/10.1063/5.0005082" target="_blank">https://doi.org/10.1063/5.0005082</a>.  </div><div><br></div><div>I have two issues/questions with QE-GPU:</div><div>1. The largest discrepancy in the atomic force between CPU and GPU is 1.34x10^-4 Ry/Bohr. What is the acceptable value for the discrepancy?  </div><div>2. I am experiencing a significant increase in CPU time when I use multiple OMP threads for SCF calculations, as you can see below. Could you please suggest any solution to this and let me know if I am doing anything incorrectly? Any help would be much appreciated.</div><div>The details are as follows:<br></div><div><br></div><div>nGPU=1<br></div><div>--------------------------------<br></div><div># of Threads                      CPU Time (s)                          WALL Time(s)<br></div><div>01                                           254.23                                    384.27<br>02                                           295.45                                    466.33<br>03                                           328.89                                    538.62<br>04                                           348.81                                    602.85<br>08                                           501.31                                    943.32<br>12                                           698.45                                    1226.86<br>16                                           836.71                                    1505.39<br>20                                           905.77                                    1645.66<br>24                                           1094.81                                   1973.97<br>28                                           1208.93                                   2278.81<br>32                                           1403.27                                   2570.51<br>36                                           1688.97                                   3068.91<br>40                                           1820.06                                   3306.49<br>44                                           1905.88                                   3603.96<br>48                                           2163.18                                   4088.75<br></div><div>--------------------------------<br></div><div><br></div><div>nGPU=2</div><div>--------------------------------<br></div><div># of Threads                      CPU Time (s)                          WALL Time(s)<br></div><div>01                                           226.69                                    329.51<br>02                                           271.29                                    336.65<br>03                                           312.36                                    335.24<br>04                                           341.50                                    333.20<br>06                                           400.42                                    328.66<br>12                                           632.82                                    332.90<br>24                                           992.02                                    335.28<br>48                                           1877.65                                  438.40<br></div><div>--------------------------------<br></div><div><br></div><div>nGPU=4</div><div>--------------------------------</div><div># of Threads                      CPU Time (s)                          WALL Time(s)<br></div><div>01                                           237.48                                    373.21<br>02                                           268.85                                    382.92<br>03                                           311.39                                    391.29<br>04                                           341.14                                    391.71<br>06                                           422.42                                    391.13<br>12                                           632.94                                    396.75<br>24                                           961.57                                    474.70<br>48                                           2509.10                                  894.79<br></div><div>--------------------------------<br></div><div><br></div><div>The input file is:</div><div>--------------------------------------------<br></div><div>&control<br>    calculation = 'scf',<br>    prefix = "cofe2o4"<br>    outdir = "./t" <br>    pseudo_dir = "./"<br>    tstress=.true.<br>    tprnfor=.true.<br>/<br>&system<br>    ibrav = 2,<br>     nat = 14,<br>     ntyp = 4,<br>    celldm(1) = 15.9647d0<br>    ecutwfc = 45<br>    ecutrho = 450<br>    nspin = 2<br>    starting_magnetization(1)= 1.0,<br>    starting_magnetization(3)=1.0,<br>    starting_magnetization(2)=-1.0,<br>    occupations = 'smearing',<br>    degauss = 0.005,<br>    smearing = 'mv'<br>    lda_plus_u = .true.,<br>    lda_plus_u_kind = 0,<br>    U_projection_type = 'atomic',<br>    Hubbard_U(1) = 3.5D0<br>    Hubbard_U(2) = 3.5D0<br>    Hubbard_U(3) = 3.0D0<br>/<br>&electrons<br>    mixing_mode = 'local-TF'<br>    mixing_beta = 0.2<br>    conv_thr = 1.D-7<br>    electron_maxstep = 250<br>    diagonalization ='david'<br>/<br>&IONS<br>/<br>ATOMIC_SPECIES<br>   Fe1   55.8450000000  Fe.pbe-sp-van_mit.UPF<br>   Fe2   55.8450000000  Fe.pbe-sp-van_mit.UPF<br>   Co   58.9332000000  Co.pbe-nd-rrkjus.UPF<br>    O   15.9994000000  O.pbe-rrkjus.UPF<br>ATOMIC_POSITIONS crystal<br>Fe1           0.0000000000        0.5000000000        0.5000000000<br>Fe1           0.5000000000        0.0000000000        0.5000000000<br>Co            0.5000000000        0.5000000000        0.0000000000<br>Co            0.5000000000        0.5000000000        0.5000000000<br>Fe2           0.1206093444        0.1206093444        0.1293906556<br>Fe2           0.8793906556        0.8793906556        0.8706093444<br>O             0.2489473315        0.2489473315        0.2660301248<br>O             0.2489473315        0.2489473315        0.7360752123<br>O            -0.2447080455        0.2661185400        0.7392947527<br>O             0.2447080455        0.7338814600        0.2607052473<br>O             0.2661185400        0.7552919545       -0.2607052473<br>O             0.7338814600        0.2447080455        0.2607052473<br>O             0.7510526685       -0.2489473315        0.2639247877<br>O             0.7510526685        0.7510526685        0.7339698752<br>K_POINTS (automatic)<br>7 7 7 0 0 0<br></div><div>-----------------------------------------------------------</div><div><br></div><div>Best regards</div><div>Manish Kumar</div><div>IISER Pune, India</div></div></div><div hspace="streak-pt-mark" style="max-height:1px"><img alt="" style="width: 0px; max-height: 0px; overflow: hidden;" src="https://mailfoogae.appspot.com/t?sender=abWFuaXNoLmt1bWFyQGFjYWRzLmlpc2VycHVuZS5hYy5pbg%3D%3D&type=zerocontent&guid=ad58b12e-0d30-4961-b195-ef64b1ad9946"><font color="#ffffff" size="1">ᐧ</font></div>
_______________________________________________<br>
Quantum ESPRESSO is supported by MaX (<a href="http://www.max-centre.eu" rel="noreferrer" target="_blank">www.max-centre.eu</a>)<br>
users mailing list <a href="mailto:users@lists.quantum-espresso.org" target="_blank">users@lists.quantum-espresso.org</a><br>
<a href="https://lists.quantum-espresso.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.quantum-espresso.org/mailman/listinfo/users</a></blockquote></div>
</blockquote></div>
_______________________________________________<br>
<br>
The Quantum ESPRESSO community stands by the Ukrainian people and expresses its concerns for the devastating effects that the Russian military offensive has on their country and on the free and peaceful scientific, cultural, and economic cooperation amongst peoples<br>
_______________________________________________<br>
Quantum ESPRESSO is supported by MaX (<a href="http://www.max-centre.eu" rel="noreferrer" target="_blank">www.max-centre.eu</a>)<br>
users mailing list <a href="mailto:users@lists.quantum-espresso.org" target="_blank">users@lists.quantum-espresso.org</a><br>
<a href="https://lists.quantum-espresso.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.quantum-espresso.org/mailman/listinfo/users</a></blockquote></div>
_______________________________________________<br>
The Quantum ESPRESSO community stands by the Ukrainian<br>
people and expresses its concerns about the devastating<br>
effects that the Russian military offensive has on their<br>
country and on the free and peaceful scientific, cultural,<br>
and economic cooperation amongst peoples<br>
_______________________________________________<br>
Quantum ESPRESSO is supported by MaX (<a href="http://www.max-centre.eu" rel="noreferrer" target="_blank">www.max-centre.eu</a>)<br>
users mailing list <a href="mailto:users@lists.quantum-espresso.org" target="_blank">users@lists.quantum-espresso.org</a><br>
<a href="https://lists.quantum-espresso.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.quantum-espresso.org/mailman/listinfo/users</a></blockquote></div>