[QE-users] QE-GPU: Discrepancy in forces and problem in using OMP threading

Tue Mar 15 22:19:00 CET 2022

Indipendently by the presence of a GPU, it is good practice NOT
oversubscribe physical cores.

So, made up example, if your socket has 128 cores and ypou want to use 16
MPI then The number of OpenMP thread is 8 (128/16). If you specify more,
you oversubscribe and as result performance may suck. It is also good
practice have a MPI:GPU ration of 1:1 or maybe 2:1. But start with 1:1.

Regarding the discrepancy in the atomic force I let the developers comment.
If you really believe it is a bug, open a bug report on the GitLab
https://gitlab.com/QEF/q-e/-/issues  and provide everything needed to
reproduce the error.

HTH

--
Filippo SPIGA ~ http://fspiga.github.io ~ skype: filippo.spiga

On Mon, 14 Mar 2022 at 17:51, Manish Kumar <
manish.kumar at acads.iiserpune.ac.in> wrote:

>   Dear Filippo,
>
> Thank you very much for your reply.
>
> The "# of threads" is the value of OMP_NUM_TRHREADS. I used nGPU=4
> and OMP_NUM_TRHREADS=48. I think the combination is not appropriate.
> The OMP_NUM_TRHREADS value should not be higher than 12. Am I correct?
>
> On one node, I am able to run the calculation. For a bigger system (388
> atoms, 3604 electrons) I used multiple nodes (2 to 4 nodes each with 4
> GPUs). The calculation got killed during the force calculation with the
> following error messages:
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>      Error in routine addusforce_gpu (1):
>      cannot allocate buffers
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
> The slurm script (for 2 nodes) for the about calculation is the following:
> #-------------------------------------------------
> #SBATCH --nodes=2
> #SBATCH --gres=gpu:4
> #SBATCH --ntasks=8
> #SBATCH --ntasks-per-node=4
> #SBATCH --cpus-per-task=12
>
> export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
>
> mpirun -np 8 pw.x -inp input.in
> or
> mpirun -np 8 --map-by ppr:4:node:PE=12 pw.x -inp input.in
> #---------------------------------------------
>
> I cannot solve or understand the root cause of this error. Do you have any
> suggestions to resolve it?
> Also, I would appreciate your comments on the discrepancy between CPU and
> GPU, which I mentioned in my previous email.
>
> Thank you in advance!
>
> Best regards
> Manish Kumar
> IISER Pune, India
>
> ᐧ
>
> On Fri, Mar 4, 2022 at 3:05 PM Filippo Spiga <spiga.filippo at gmail.com>
> wrote:
>
>> Ops, typo while typing from the phone...
>>
>> "are you using OMP_NUM_THREADS=48 or OMP_NUM_THREADS=12?"
>>
>> (everything else is correct)
>>
>> --
>> Filippo SPIGA ~ http://fspiga.github.io ~ skype: filippo.spiga
>>
>>
>> On Fri, 4 Mar 2022 at 09:33, Filippo Spiga <spiga.filippo at gmail.com>
>> wrote:
>>
>>> Dear Manish,
>>>
>>> when you use nGPU=4, the "# of Threads" column specify the aggregate
>>> number of threads? Meaning, are you using OMP_NUM_TRHREADS=48 or
>>> OMP_NUM_TRHREADS=48? From you email it is not clear and, if you
>>> oversubscribe physical cores with threads or processes then performance is
>>> not going to be great.
>>>
>>> Also, you must manage bindings properly otherwise MPI processed bind to
>>> GPU on another socket need top cross the awful CPU-to-CPU link. Have a look
>>> at '--map-by' option in mpirun. For 4 GPU, using 4 MPI processes and 12
>>> OpenMP threads, your mpirun will look like this:
>>>
>>> export OMP_NUM_THRTEADS=12
>>> mpirun -np 4 --map-by ppr:4:node:PE=12 ./pw.x
>>>
>>> If you are running on a HPC system managed by someone else, try reach
>>> out the User Support and get guidance on correct binding and environment.
>>> What you are observing is very likely not related to QE-GPU but how you are
>>> running your calculations.
>>>
>>> HTH
>>>
>>> --
>>> Filippo SPIGA ~ http://fspiga.github.io ~ skype: filippo.spiga
>>>
>>>
>>> On Wed, 2 Mar 2022 at 08:40, Manish Kumar <
>>> manish.kumar at acads.iiserpune.ac.in> wrote:
>>>
>>>> Dear all,
>>>>
>>>> I am using QE-GPU compiled on a 48-core Intel(R) Xeon(R) Platinum 8268
>>>> CPU @ 2.90GHz and four NVIDIA V100 GPU cards. To use all the CPUs, I am
>>>> using the OMP_NUM_THREADS variable in the slurm script. The jobs are run
>>>> with "mpirun -np [nGPU] pw.x", where nGPU refers to the number of GPUs
>>>> used. Our system size (130 electrons and 64 k-points, the input file is
>>>> given below) is comparable to some systems in J. Chem. Phys. 152, 154105
>>>> (2020); https://doi.org/10.1063/5.0005082.
>>>>
>>>> I have two issues/questions with QE-GPU:
>>>> 1. The largest discrepancy in the atomic force between CPU and GPU is
>>>> 1.34x10^-4 Ry/Bohr. What is the acceptable value for the discrepancy?
>>>> 2. I am experiencing a significant increase in CPU time when I use
>>>> multiple OMP threads for SCF calculations, as you can see below. Could you
>>>> please suggest any solution to this and let me know if I am doing anything
>>>> incorrectly? Any help would be much appreciated.
>>>> The details are as follows:
>>>>
>>>> nGPU=1
>>>> --------------------------------
>>>> # of Threads                      CPU Time (s)
>>>> WALL Time(s)
>>>> 01                                           254.23
>>>>                384.27
>>>> 02                                           295.45
>>>>                466.33
>>>> 03                                           328.89
>>>>                538.62
>>>> 04                                           348.81
>>>>                602.85
>>>> 08                                           501.31
>>>>                943.32
>>>> 12                                           698.45
>>>>                1226.86
>>>> 16                                           836.71
>>>>                1505.39
>>>> 20                                           905.77
>>>>                1645.66
>>>> 24                                           1094.81
>>>>                1973.97
>>>> 28                                           1208.93
>>>>                2278.81
>>>> 32                                           1403.27
>>>>                2570.51
>>>> 36                                           1688.97
>>>>                3068.91
>>>> 40                                           1820.06
>>>>                3306.49
>>>> 44                                           1905.88
>>>>                3603.96
>>>> 48                                           2163.18
>>>>                4088.75
>>>> --------------------------------
>>>>
>>>> nGPU=2
>>>> --------------------------------
>>>> # of Threads                      CPU Time (s)
>>>> WALL Time(s)
>>>> 01                                           226.69
>>>>                329.51
>>>> 02                                           271.29
>>>>                336.65
>>>> 03                                           312.36
>>>>                335.24
>>>> 04                                           341.50
>>>>                333.20
>>>> 06                                           400.42
>>>>                328.66
>>>> 12                                           632.82
>>>>                332.90
>>>> 24                                           992.02
>>>>                335.28
>>>> 48                                           1877.65
>>>>               438.40
>>>> --------------------------------
>>>>
>>>> nGPU=4
>>>> --------------------------------
>>>> # of Threads                      CPU Time (s)
>>>> WALL Time(s)
>>>> 01                                           237.48
>>>>                373.21
>>>> 02                                           268.85
>>>>                382.92
>>>> 03                                           311.39
>>>>                391.29
>>>> 04                                           341.14
>>>>                391.71
>>>> 06                                           422.42
>>>>                391.13
>>>> 12                                           632.94
>>>>                396.75
>>>> 24                                           961.57
>>>>                474.70
>>>> 48                                           2509.10
>>>>               894.79
>>>> --------------------------------
>>>>
>>>> The input file is:
>>>> --------------------------------------------
>>>> &control
>>>>     calculation = 'scf',
>>>>     prefix = "cofe2o4"
>>>>     outdir = "./t"
>>>>     pseudo_dir = "./"
>>>>     tstress=.true.
>>>>     tprnfor=.true.
>>>> /
>>>> &system
>>>>     ibrav = 2,
>>>>      nat = 14,
>>>>      ntyp = 4,
>>>>     celldm(1) = 15.9647d0
>>>>     ecutwfc = 45
>>>>     ecutrho = 450
>>>>     nspin = 2
>>>>     starting_magnetization(1)= 1.0,
>>>>     starting_magnetization(3)=1.0,
>>>>     starting_magnetization(2)=-1.0,
>>>>     occupations = 'smearing',
>>>>     degauss = 0.005,
>>>>     smearing = 'mv'
>>>>     lda_plus_u = .true.,
>>>>     lda_plus_u_kind = 0,
>>>>     U_projection_type = 'atomic',
>>>>     Hubbard_U(1) = 3.5D0
>>>>     Hubbard_U(2) = 3.5D0
>>>>     Hubbard_U(3) = 3.0D0
>>>> /
>>>> &electrons
>>>>     mixing_mode = 'local-TF'
>>>>     mixing_beta = 0.2
>>>>     conv_thr = 1.D-7
>>>>     electron_maxstep = 250
>>>>     diagonalization ='david'
>>>> /
>>>> &IONS
>>>> /
>>>> ATOMIC_SPECIES
>>>>    Fe1   55.8450000000  Fe.pbe-sp-van_mit.UPF
>>>>    Fe2   55.8450000000  Fe.pbe-sp-van_mit.UPF
>>>>    Co   58.9332000000  Co.pbe-nd-rrkjus.UPF
>>>>     O   15.9994000000  O.pbe-rrkjus.UPF
>>>> ATOMIC_POSITIONS crystal
>>>> Fe1           0.0000000000        0.5000000000        0.5000000000
>>>> Fe1           0.5000000000        0.0000000000        0.5000000000
>>>> Co            0.5000000000        0.5000000000        0.0000000000
>>>> Co            0.5000000000        0.5000000000        0.5000000000
>>>> Fe2           0.1206093444        0.1206093444        0.1293906556
>>>> Fe2           0.8793906556        0.8793906556        0.8706093444
>>>> O             0.2489473315        0.2489473315        0.2660301248
>>>> O             0.2489473315        0.2489473315        0.7360752123
>>>> O            -0.2447080455        0.2661185400        0.7392947527
>>>> O             0.2447080455        0.7338814600        0.2607052473
>>>> O             0.2661185400        0.7552919545       -0.2607052473
>>>> O             0.7338814600        0.2447080455        0.2607052473
>>>> O             0.7510526685       -0.2489473315        0.2639247877
>>>> O             0.7510526685        0.7510526685        0.7339698752
>>>> K_POINTS (automatic)
>>>> 7 7 7 0 0 0
>>>> -----------------------------------------------------------
>>>>
>>>> Best regards
>>>> Manish Kumar
>>>> IISER Pune, India
>>>> ᐧ
>>>> _______________________________________________
>>>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>>>> users mailing list users at lists.quantum-espresso.org
>>>> https://lists.quantum-espresso.org/mailman/listinfo/users
>>>
>>> _______________________________________________
>>
>> The Quantum ESPRESSO community stands by the Ukrainian people and
>> expresses its concerns for the devastating effects that the Russian
>> military offensive has on their country and on the free and peaceful
>> scientific, cultural, and economic cooperation amongst peoples
>> _______________________________________________
>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>> users mailing list users at lists.quantum-espresso.org
>> https://lists.quantum-espresso.org/mailman/listinfo/users
>
> _______________________________________________
> The Quantum ESPRESSO community stands by the Ukrainian
> people and expresses its concerns about the devastating
> effects that the Russian military offensive has on their
> country and on the free and peaceful scientific, cultural,
> and economic cooperation amongst peoples
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20220315/0961623f/attachment.html>