[QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.

Paolo Giannozzi paolo.giannozzi at uniud.it
Tue Oct 29 21:56:44 CET 2024


On 29/10/2024 18:52, Dyer, Brock wrote:

> My current submission script uses 4 tasks per node, and my input only 
> has 1 k-point. I feel it is pertinent to mention that I am running 
> molecular systems, not a crystal or any sort of repeating structure. 
> There are only 31 Kohn-Sham states in the system, and the FFT grid is 
> (192,192,192). 

your system is not one of the best performing one on GPUs, having few 
Kohn-Sham states, one k-point and a large FFT grid for the charge 
density and potentials. Moreover, your code version (7.0) has limited 
GPU porting. In fact, the vast majority of the time is spent in 
v_of_rho, that computes the potential from the charge density. This part 
has been ported to GPU in later versions.

Paolo

> ------------------------------------------------------------------------
> *From:* Paolo Giannozzi <paolo.giannozzi at uniud.it>
> *Sent:* Monday, October 28, 2024 12:04 PM
> *To:* Quantum ESPRESSO users Forum <users at lists.quantum-espresso.org>
> *Cc:* Dyer, Brock <brdyer at ursinus.edu>
> *Subject:* Re: [QE-users] [QE-GPU] GPU runs significantly slower than 
> CPU runs.
> The performances on GPU depend upon a lot of factors, e.g., the size of
> the system and how the code is run. One should run one MPI per GPU and
> use low-communication parallelization (e.g. on k points) whenever possible.
> 
> Paolo
> 
> On 10/24/24 17:38, Dyer, Brock wrote:
>  >
>  > Hi folks, I have been trying to use the QE 7.0 GPU-accelerated version
>  > of pw.x lately, and have noticed that it is significantly (10x) slower
>  > than the CPU version. The GPU nodes I use have an AMD EPYC 7763
>  > processor (64 cores, 128 threads) and 4 NVIDIA A100 (40gb each) GPUs,
>  > and the CPU nodes have 2 AMD EPYC 7763 processors. The time reports from
>  > runs on identical input files are below (GPU first, then CPU):
>  >
>  > GPU Version:
>  >
>  >       init_run     :     14.17s CPU     19.29s WALL (       1 calls)
>  >       electrons    :   1352.63s CPU   1498.17s WALL (      19 calls)
>  >       update_pot   :    144.15s CPU    158.77s WALL (      18 calls)
>  >       forces       :    144.74s CPU    158.92s WALL (      19 calls)
>  >
>  >       Called by init_run:
>  >       wfcinit      :      0.14s CPU      2.10s WALL (       1 calls)
>  >                                          2.10s GPU  (       1 calls)
>  >       potinit      :     12.83s CPU     13.78s WALL (       1 calls)
>  >       hinit0       :      0.29s CPU      0.35s WALL (       1 calls)
>  >
>  >       Called by electrons:
>  >       c_bands      :     30.64s CPU     38.04s WALL (     173 calls)
>  >       sum_band     :     36.93s CPU     40.47s WALL (     173 calls)
>  >       v_of_rho     :   1396.71s CPU   1540.48s WALL (     185 calls)
>  >       newd         :     13.67s CPU     20.30s WALL (     185 calls)
>  >                                          9.04s GPU  (     167 calls)
>  >       mix_rho      :     26.02s CPU     27.31s WALL (     173 calls)
>  >       vdW_kernel   :      4.99s CPU      5.01s WALL (       1 calls)
>  >
>  >       Called by c_bands:
>  >       init_us_2    :      0.24s CPU      0.39s WALL (     347 calls)
>  >       init_us_2:gp :      0.23s CPU      0.38s WALL (     347 calls)
>  >       regterg      :     29.53s CPU     36.07s WALL (     173 calls)
>  >
>  >       Called by *egterg:
>  >       rdiaghg      :      0.61s CPU      1.74s WALL (     585 calls)
>  >                                                  1.72s GPU  (     585 
> calls)
>  >       h_psi        :     26.71s CPU     33.73s WALL (     611 calls)
>  >                                                  33.69s GPU  (     611
>  > calls)
>  >       s_psi        :      0.08s CPU      0.16s WALL (     611 calls)
>  >                                                  0.14s GPU  (     611 
> calls)
>  >       g_psi        :      0.00s CPU      0.04s WALL (     437 calls)
>  >                                                  0.04s GPU  (     437 
> calls)
>  >
>  >       Called by h_psi:
>  >       h_psi:calbec :      0.27s CPU      0.32s WALL (     611 calls)
>  >                                                      0.32s GPU  (
>  > 611 calls)
>  >       vloc_psi     :     26.11s CPU     33.04s WALL (     611 calls)
>  >                                                     33.02s GPU  (
>  > 611 calls)
>  >       add_vuspsi   :      0.06s CPU      0.14s WALL (     611 calls)
>  >                                                     0.13s GPU  (     611
>  > calls)
>  >
>  >       General routines
>  >       calbec       :      0.32s CPU      0.37s WALL (     860 calls)
>  >       fft          :    778.93s CPU    892.58s WALL (   12061 calls)
>  >                                                  13.39s GPU  (    1263
>  > calls)
>  >       ffts         :     12.40s CPU     12.96s WALL (     173 calls)
>  >       fftw         :     30.44s CPU     39.53s WALL (    3992 calls)
>  >                                                  38.80s GPU  (    3992
>  > calls)
>  >       Parallel routines
>  >       PWSCF        :  27m46.53s CPU  30m49.28s WALL
>  >
>  > CPU Version:
>  >
>  >
>  >       init_run     :      2.35s CPU      2.79s WALL (       1 calls)
>  >
>  >       electrons    :     99.04s CPU    142.56s WALL (      19 calls)
>  >       update_pot   :      9.01s CPU     13.47s WALL (      18 calls)
>  >       forces       :      9.89s CPU     14.35s WALL (      19 calls)
>  >
>  >       Called by init_run:
>  >       wfcinit      :      0.08s CPU      0.17s WALL (       1 calls)
>  >       potinit      :      1.27s CPU      1.50s WALL (       1 calls)
>  >       hinit0       :      0.27s CPU      0.33s WALL (       1 calls)
>  >
>  >       Called by electrons:
>  >       c_bands      :     28.09s CPU     33.01s WALL (     173 calls)
>  >       sum_band     :     13.69s CPU     14.89s WALL (     173 calls)
>  >       v_of_rho     :     56.29s CPU     95.06s WALL (     185 calls)
>  >       newd         :      5.60s CPU      6.38s WALL (     185 calls)
>  >       mix_rho      :      1.37s CPU      1.65s WALL (     173 calls)
>  >       vdW_kernel   :      0.84s CPU      0.88s WALL (       1 calls)
>  >
>  >       Called by c_bands:
>  >       init_us_2    :      0.54s CPU      0.62s WALL (     347 calls)
>  >       init_us_2:cp :      0.54s CPU      0.62s WALL (     347 calls)
>  >       regterg      :     27.54s CPU     32.31s WALL (     173 calls)
>  >
>  >       Called by *egterg:
>  >       rdiaghg      :      0.45s CPU      0.49s WALL (     584 calls)
>  >       h_psi        :     23.00s CPU     27.54s WALL (     610 calls)
>  >       s_psi        :      0.64s CPU      0.66s WALL (     610 calls)
>  >       g_psi        :      0.04s CPU      0.04s WALL (     436 calls)
>  >
>  >       Called by h_psi:
>  >       h_psi:calbec :      1.53s CPU      1.75s WALL (     610 calls)
>  >       vloc_psi     :     20.46s CPU     24.73s WALL (     610 calls)
>  >       vloc_psi:tg_ :      1.62s CPU      1.71s WALL (     610 calls)
>  >       add_vuspsi   :      0.82s CPU      0.86s WALL (     610 calls)
>  >
>  >       General routines
>  >       calbec       :      2.20s CPU      2.52s WALL (     859 calls)
>  >       fft          :     40.10s CPU     76.07s WALL (   12061 calls)
>  >       ffts         :      0.66s CPU      0.73s WALL (     173 calls)
>  >       fftw         :     18.72s CPU     22.92s WALL (    8916 calls)
>  >
>  >       Parallel routines
>  >       fft_scatt_xy :     15.80s CPU     20.80s WALL (   21150 calls)
>  >       fft_scatt_yz :     27.55s CPU     58.79s WALL (   21150 calls)
>  >       fft_scatt_tg :      3.60s CPU      4.31s WALL (    8916 calls)
>  >
>  >       PWSCF        :   2m 1.29s CPU   2m54.94s WALL
>  >
>  >
>  > This version of QE was compiled on the Perlmutter supercomputer at
>  > NERSC. Here are the compile specifications:
>  >
>  > # Modules
>  >
>  >
>  > Currently Loaded Modules:
>  >    1) craype-x86-milan
>  >    2) libfabric/1.11.0.4.114
>  >    3) craype-network-ofi
>  >    4) perftools-base/22.04.0
>  >    5) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta
>  >    6) xalt/2.10.2
>  >    7) nvidia/21.11                         (g,c)
>  >    8) craype/2.7.15                        (c)
>  >    9) cray-dsmml/0.2.2
>  >   10) cray-mpich/8.1.15                    (mpi)
>  >   11) PrgEnv-nvidia/8.3.3                  (cpe)
>  >   12) Nsight-Compute/2022.1.1
>  >   13) Nsight-Systems/2022.2.1
>  >   14) cudatoolkit/11.5                     (g)
>  >   15) cray-fftw/3.3.8.13                   (math)
>  >   16) cray-hdf5-parallel/1.12.1.1          (io)
>  >
>  > export
>  > LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cray/pe/fftw/3.3.8.13/ 
> x86_milan/lib:/opt/cray/pe/hdf5-parallel/1.12.1.1/nvidia/20.7/lib
>  >
>  > ./configure CC=cc CXX=CC FC=ftn MPIF90=ftn --with-cuda=$CUDA_HOME
>  > --with-cuda-cc=80 --with-cuda-runtime=11.0 --enable-parallel
>  > --enable-openmp --disable-shared --with-scalapack=yes
>  > FFLAGS="-Mpreprocess" FCFLAGS="-Mpreprocess" LDFLAGS="-acc" --with-libxc
>  > --with-libxc-prefix=/global/common/software/nersc/pm-2021q4/sw/libxc/ 
> v5.2.2/alv-gpu --with-hdf5=${HDF5_DIR}
>  >
>  > make veryclean
>  > make all
>  >
>  > # go to EPW directory: make; then go to main binary directory and link
>  > to epw.x executable
>  >
>  >
>  > If there is any more information required, please let me know and I will
>  > try to get it promptly!
>  >
>  >
>  >
>  > _______________________________________________
>  > The Quantum ESPRESSO community stands by the Ukrainian
>  > people and expresses its concerns about the devastating
>  > effects that the Russian military offensive has on their
>  > country and on the free and peaceful scientific, cultural,
>  > and economic cooperation amongst peoples
>  > _______________________________________________
>  > Quantum ESPRESSO is supported by MaX (https:// 
> linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.max- 
> centre.eu&c=E,1,cWbnpY9zUPefGXxiZSF12XOautWJAuDDFiJbFYnq9WBzpkUV230lnuKhUlKJtRU9lbXSojSGIQLi8LepX_Phu8BlxnVOkg8GOAQC40NKg2g3WE-1tGdHyA,,&typo=1 <https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.max-centre.eu&c=E,1,cWbnpY9zUPefGXxiZSF12XOautWJAuDDFiJbFYnq9WBzpkUV230lnuKhUlKJtRU9lbXSojSGIQLi8LepX_Phu8BlxnVOkg8GOAQC40NKg2g3WE-1tGdHyA,,&typo=1>)
>  > users mailing list users at lists.quantum-espresso.org
>  > https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.quantum- 
> espresso.org%2fmailman%2flistinfo%2fusers&c=E,1,TNLdBqTaC8ekN7v0b1PfY8fJCd1FWca8_UUXHVf-5CQl5b-ay9pPk-X2E9lKriQUUcM3dalM445rfZw0W2XftHvT0QQaHLVstRxSUG7muw,,&typo=1 <https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.quantum-espresso.org%2fmailman%2flistinfo%2fusers&c=E,1,TNLdBqTaC8ekN7v0b1PfY8fJCd1FWca8_UUXHVf-5CQl5b-ay9pPk-X2E9lKriQUUcM3dalM445rfZw0W2XftHvT0QQaHLVstRxSUG7muw,,&typo=1>
> 
> --
> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216

-- 
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216



More information about the users mailing list