[QE-users] QE-GPU: Discrepancy in forces and problem in using OMP threading

Fri Mar 4 10:34:43 CET 2022

Ops, typo while typing from the phone...

"are you using OMP_NUM_THREADS=48 or OMP_NUM_THREADS=12?"

(everything else is correct)

--
Filippo SPIGA ~ http://fspiga.github.io ~ skype: filippo.spiga

On Fri, 4 Mar 2022 at 09:33, Filippo Spiga <spiga.filippo at gmail.com> wrote:

> Dear Manish,
>
> when you use nGPU=4, the "# of Threads" column specify the aggregate
> number of threads? Meaning, are you using OMP_NUM_TRHREADS=48 or
> OMP_NUM_TRHREADS=48? From you email it is not clear and, if you
> oversubscribe physical cores with threads or processes then performance is
> not going to be great.
>
> Also, you must manage bindings properly otherwise MPI processed bind to
> GPU on another socket need top cross the awful CPU-to-CPU link. Have a look
> at '--map-by' option in mpirun. For 4 GPU, using 4 MPI processes and 12
> OpenMP threads, your mpirun will look like this:
>
> export OMP_NUM_THRTEADS=12
> mpirun -np 4 --map-by ppr:4:node:PE=12 ./pw.x
>
> If you are running on a HPC system managed by someone else, try reach out
> the User Support and get guidance on correct binding and environment. What
> you are observing is very likely not related to QE-GPU but how you are
> running your calculations.
>
> HTH
>
> --
> Filippo SPIGA ~ http://fspiga.github.io ~ skype: filippo.spiga
>
>
> On Wed, 2 Mar 2022 at 08:40, Manish Kumar <
> manish.kumar at acads.iiserpune.ac.in> wrote:
>
>> Dear all,
>>
>> I am using QE-GPU compiled on a 48-core Intel(R) Xeon(R) Platinum 8268
>> CPU @ 2.90GHz and four NVIDIA V100 GPU cards. To use all the CPUs, I am
>> using the OMP_NUM_THREADS variable in the slurm script. The jobs are run
>> with "mpirun -np [nGPU] pw.x", where nGPU refers to the number of GPUs
>> used. Our system size (130 electrons and 64 k-points, the input file is
>> given below) is comparable to some systems in J. Chem. Phys. 152, 154105
>> (2020); https://doi.org/10.1063/5.0005082.
>>
>> I have two issues/questions with QE-GPU:
>> 1. The largest discrepancy in the atomic force between CPU and GPU is
>> 1.34x10^-4 Ry/Bohr. What is the acceptable value for the discrepancy?
>> 2. I am experiencing a significant increase in CPU time when I use
>> multiple OMP threads for SCF calculations, as you can see below. Could you
>> please suggest any solution to this and let me know if I am doing anything
>> incorrectly? Any help would be much appreciated.
>> The details are as follows:
>>
>> nGPU=1
>> --------------------------------
>> # of Threads                      CPU Time (s)
>> WALL Time(s)
>> 01                                           254.23
>>              384.27
>> 02                                           295.45
>>              466.33
>> 03                                           328.89
>>              538.62
>> 04                                           348.81
>>              602.85
>> 08                                           501.31
>>              943.32
>> 12                                           698.45
>>              1226.86
>> 16                                           836.71
>>              1505.39
>> 20                                           905.77
>>              1645.66
>> 24                                           1094.81
>>              1973.97
>> 28                                           1208.93
>>              2278.81
>> 32                                           1403.27
>>              2570.51
>> 36                                           1688.97
>>              3068.91
>> 40                                           1820.06
>>              3306.49
>> 44                                           1905.88
>>              3603.96
>> 48                                           2163.18
>>              4088.75
>> --------------------------------
>>
>> nGPU=2
>> --------------------------------
>> # of Threads                      CPU Time (s)
>> WALL Time(s)
>> 01                                           226.69
>>              329.51
>> 02                                           271.29
>>              336.65
>> 03                                           312.36
>>              335.24
>> 04                                           341.50
>>              333.20
>> 06                                           400.42
>>              328.66
>> 12                                           632.82
>>              332.90
>> 24                                           992.02
>>              335.28
>> 48                                           1877.65
>>             438.40
>> --------------------------------
>>
>> nGPU=4
>> --------------------------------
>> # of Threads                      CPU Time (s)
>> WALL Time(s)
>> 01                                           237.48
>>              373.21
>> 02                                           268.85
>>              382.92
>> 03                                           311.39
>>              391.29
>> 04                                           341.14
>>              391.71
>> 06                                           422.42
>>              391.13
>> 12                                           632.94
>>              396.75
>> 24                                           961.57
>>              474.70
>> 48                                           2509.10
>>             894.79
>> --------------------------------
>>
>> The input file is:
>> --------------------------------------------
>> &control
>>     calculation = 'scf',
>>     prefix = "cofe2o4"
>>     outdir = "./t"
>>     pseudo_dir = "./"
>>     tstress=.true.
>>     tprnfor=.true.
>> /
>> &system
>>     ibrav = 2,
>>      nat = 14,
>>      ntyp = 4,
>>     celldm(1) = 15.9647d0
>>     ecutwfc = 45
>>     ecutrho = 450
>>     nspin = 2
>>     starting_magnetization(1)= 1.0,
>>     starting_magnetization(3)=1.0,
>>     starting_magnetization(2)=-1.0,
>>     occupations = 'smearing',
>>     degauss = 0.005,
>>     smearing = 'mv'
>>     lda_plus_u = .true.,
>>     lda_plus_u_kind = 0,
>>     U_projection_type = 'atomic',
>>     Hubbard_U(1) = 3.5D0
>>     Hubbard_U(2) = 3.5D0
>>     Hubbard_U(3) = 3.0D0
>> /
>> &electrons
>>     mixing_mode = 'local-TF'
>>     mixing_beta = 0.2
>>     conv_thr = 1.D-7
>>     electron_maxstep = 250
>>     diagonalization ='david'
>> /
>> &IONS
>> /
>> ATOMIC_SPECIES
>>    Fe1   55.8450000000  Fe.pbe-sp-van_mit.UPF
>>    Fe2   55.8450000000  Fe.pbe-sp-van_mit.UPF
>>    Co   58.9332000000  Co.pbe-nd-rrkjus.UPF
>>     O   15.9994000000  O.pbe-rrkjus.UPF
>> ATOMIC_POSITIONS crystal
>> Fe1           0.0000000000        0.5000000000        0.5000000000
>> Fe1           0.5000000000        0.0000000000        0.5000000000
>> Co            0.5000000000        0.5000000000        0.0000000000
>> Co            0.5000000000        0.5000000000        0.5000000000
>> Fe2           0.1206093444        0.1206093444        0.1293906556
>> Fe2           0.8793906556        0.8793906556        0.8706093444
>> O             0.2489473315        0.2489473315        0.2660301248
>> O             0.2489473315        0.2489473315        0.7360752123
>> O            -0.2447080455        0.2661185400        0.7392947527
>> O             0.2447080455        0.7338814600        0.2607052473
>> O             0.2661185400        0.7552919545       -0.2607052473
>> O             0.7338814600        0.2447080455        0.2607052473
>> O             0.7510526685       -0.2489473315        0.2639247877
>> O             0.7510526685        0.7510526685        0.7339698752
>> K_POINTS (automatic)
>> 7 7 7 0 0 0
>> -----------------------------------------------------------
>>
>> Best regards
>> Manish Kumar
>> IISER Pune, India
>> ᐧ
>> _______________________________________________
>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>> users mailing list users at lists.quantum-espresso.org
>> https://lists.quantum-espresso.org/mailman/listinfo/users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20220304/668281e6/attachment.html>