[QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.

Tue Oct 29 18:52:08 CET 2024

My current submission script uses 4 tasks per node, and my input only has 1 k-point. I feel it is pertinent to mention that I am running molecular systems, not a crystal or any sort of repeating structure. There are only 31 Kohn-Sham states in the system, and the FFT grid is (192,192,192). I just sort of assumed that the GPU code would always be faster than CPU, maybe not by much, but definitely 8-10x slower than the CPU code. Is that an unrealistic expectation?

________________________________
From: Paolo Giannozzi <paolo.giannozzi at uniud.it>
Sent: Monday, October 28, 2024 12:04 PM
To: Quantum ESPRESSO users Forum <users at lists.quantum-espresso.org>
Cc: Dyer, Brock <brdyer at ursinus.edu>
Subject: Re: [QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.

The performances on GPU depend upon a lot of factors, e.g., the size of
the system and how the code is run. One should run one MPI per GPU and
use low-communication parallelization (e.g. on k points) whenever possible.

Paolo

On 10/24/24 17:38, Dyer, Brock wrote:
>
> Hi folks, I have been trying to use the QE 7.0 GPU-accelerated version
> of pw.x lately, and have noticed that it is significantly (10x) slower
> than the CPU version. The GPU nodes I use have an AMD EPYC 7763
> processor (64 cores, 128 threads) and 4 NVIDIA A100 (40gb each) GPUs,
> and the CPU nodes have 2 AMD EPYC 7763 processors. The time reports from
> runs on identical input files are below (GPU first, then CPU):
>
> GPU Version:
>
>       init_run     :     14.17s CPU     19.29s WALL (       1 calls)
>       electrons    :   1352.63s CPU   1498.17s WALL (      19 calls)
>       update_pot   :    144.15s CPU    158.77s WALL (      18 calls)
>       forces       :    144.74s CPU    158.92s WALL (      19 calls)
>
>       Called by init_run:
>       wfcinit      :      0.14s CPU      2.10s WALL (       1 calls)
>                                          2.10s GPU  (       1 calls)
>       potinit      :     12.83s CPU     13.78s WALL (       1 calls)
>       hinit0       :      0.29s CPU      0.35s WALL (       1 calls)
>
>       Called by electrons:
>       c_bands      :     30.64s CPU     38.04s WALL (     173 calls)
>       sum_band     :     36.93s CPU     40.47s WALL (     173 calls)
>       v_of_rho     :   1396.71s CPU   1540.48s WALL (     185 calls)
>       newd         :     13.67s CPU     20.30s WALL (     185 calls)
>                                          9.04s GPU  (     167 calls)
>       mix_rho      :     26.02s CPU     27.31s WALL (     173 calls)
>       vdW_kernel   :      4.99s CPU      5.01s WALL (       1 calls)
>
>       Called by c_bands:
>       init_us_2    :      0.24s CPU      0.39s WALL (     347 calls)
>       init_us_2:gp :      0.23s CPU      0.38s WALL (     347 calls)
>       regterg      :     29.53s CPU     36.07s WALL (     173 calls)
>
>       Called by *egterg:
>       rdiaghg      :      0.61s CPU      1.74s WALL (     585 calls)
>                                                  1.72s GPU  (     585 calls)
>       h_psi        :     26.71s CPU     33.73s WALL (     611 calls)
>                                                  33.69s GPU  (     611
> calls)
>       s_psi        :      0.08s CPU      0.16s WALL (     611 calls)
>                                                  0.14s GPU  (     611 calls)
>       g_psi        :      0.00s CPU      0.04s WALL (     437 calls)
>                                                  0.04s GPU  (     437 calls)
>
>       Called by h_psi:
>       h_psi:calbec :      0.27s CPU      0.32s WALL (     611 calls)
>                                                      0.32s GPU  (
> 611 calls)
>       vloc_psi     :     26.11s CPU     33.04s WALL (     611 calls)
>                                                     33.02s GPU  (
> 611 calls)
>       add_vuspsi   :      0.06s CPU      0.14s WALL (     611 calls)
>                                                     0.13s GPU  (     611
> calls)
>
>       General routines
>       calbec       :      0.32s CPU      0.37s WALL (     860 calls)
>       fft          :    778.93s CPU    892.58s WALL (   12061 calls)
>                                                  13.39s GPU  (    1263
> calls)
>       ffts         :     12.40s CPU     12.96s WALL (     173 calls)
>       fftw         :     30.44s CPU     39.53s WALL (    3992 calls)
>                                                  38.80s GPU  (    3992
> calls)
>       Parallel routines
>       PWSCF        :  27m46.53s CPU  30m49.28s WALL
>
> CPU Version:
>
>
>       init_run     :      2.35s CPU      2.79s WALL (       1 calls)
>
>       electrons    :     99.04s CPU    142.56s WALL (      19 calls)
>       update_pot   :      9.01s CPU     13.47s WALL (      18 calls)
>       forces       :      9.89s CPU     14.35s WALL (      19 calls)
>
>       Called by init_run:
>       wfcinit      :      0.08s CPU      0.17s WALL (       1 calls)
>       potinit      :      1.27s CPU      1.50s WALL (       1 calls)
>       hinit0       :      0.27s CPU      0.33s WALL (       1 calls)
>
>       Called by electrons:
>       c_bands      :     28.09s CPU     33.01s WALL (     173 calls)
>       sum_band     :     13.69s CPU     14.89s WALL (     173 calls)
>       v_of_rho     :     56.29s CPU     95.06s WALL (     185 calls)
>       newd         :      5.60s CPU      6.38s WALL (     185 calls)
>       mix_rho      :      1.37s CPU      1.65s WALL (     173 calls)
>       vdW_kernel   :      0.84s CPU      0.88s WALL (       1 calls)
>
>       Called by c_bands:
>       init_us_2    :      0.54s CPU      0.62s WALL (     347 calls)
>       init_us_2:cp :      0.54s CPU      0.62s WALL (     347 calls)
>       regterg      :     27.54s CPU     32.31s WALL (     173 calls)
>
>       Called by *egterg:
>       rdiaghg      :      0.45s CPU      0.49s WALL (     584 calls)
>       h_psi        :     23.00s CPU     27.54s WALL (     610 calls)
>       s_psi        :      0.64s CPU      0.66s WALL (     610 calls)
>       g_psi        :      0.04s CPU      0.04s WALL (     436 calls)
>
>       Called by h_psi:
>       h_psi:calbec :      1.53s CPU      1.75s WALL (     610 calls)
>       vloc_psi     :     20.46s CPU     24.73s WALL (     610 calls)
>       vloc_psi:tg_ :      1.62s CPU      1.71s WALL (     610 calls)
>       add_vuspsi   :      0.82s CPU      0.86s WALL (     610 calls)
>
>       General routines
>       calbec       :      2.20s CPU      2.52s WALL (     859 calls)
>       fft          :     40.10s CPU     76.07s WALL (   12061 calls)
>       ffts         :      0.66s CPU      0.73s WALL (     173 calls)
>       fftw         :     18.72s CPU     22.92s WALL (    8916 calls)
>
>       Parallel routines
>       fft_scatt_xy :     15.80s CPU     20.80s WALL (   21150 calls)
>       fft_scatt_yz :     27.55s CPU     58.79s WALL (   21150 calls)
>       fft_scatt_tg :      3.60s CPU      4.31s WALL (    8916 calls)
>
>       PWSCF        :   2m 1.29s CPU   2m54.94s WALL
>
>
> This version of QE was compiled on the Perlmutter supercomputer at
> NERSC. Here are the compile specifications:
>
> # Modules
>
>
> Currently Loaded Modules:
>    1) craype-x86-milan
>    2) libfabric/1.11.0.4.114
>    3) craype-network-ofi
>    4) perftools-base/22.04.0
>    5) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta
>    6) xalt/2.10.2
>    7) nvidia/21.11                         (g,c)
>    8) craype/2.7.15                        (c)
>    9) cray-dsmml/0.2.2
>   10) cray-mpich/8.1.15                    (mpi)
>   11) PrgEnv-nvidia/8.3.3                  (cpe)
>   12) Nsight-Compute/2022.1.1
>   13) Nsight-Systems/2022.2.1
>   14) cudatoolkit/11.5                     (g)
>   15) cray-fftw/3.3.8.13                   (math)
>   16) cray-hdf5-parallel/1.12.1.1          (io)
>
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cray/pe/fftw/3.3.8.13/x86_milan/lib:/opt/cray/pe/hdf5-parallel/1.12.1.1/nvidia/20.7/lib
>
> ./configure CC=cc CXX=CC FC=ftn MPIF90=ftn --with-cuda=$CUDA_HOME
> --with-cuda-cc=80 --with-cuda-runtime=11.0 --enable-parallel
> --enable-openmp --disable-shared --with-scalapack=yes
> FFLAGS="-Mpreprocess" FCFLAGS="-Mpreprocess" LDFLAGS="-acc" --with-libxc
> --with-libxc-prefix=/global/common/software/nersc/pm-2021q4/sw/libxc/v5.2.2/alv-gpu --with-hdf5=${HDF5_DIR}
>
> make veryclean
> make all
>
> # go to EPW directory: make; then go to main binary directory and link
> to epw.x executable
>
>
> If there is any more information required, please let me know and I will
> try to get it promptly!
>
>
>
> _______________________________________________
> The Quantum ESPRESSO community stands by the Ukrainian
> people and expresses its concerns about the devastating
> effects that the Russian military offensive has on their
> country and on the free and peaceful scientific, cultural,
> and economic cooperation amongst peoples
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.max-centre.eu&c=E,1,cWbnpY9zUPefGXxiZSF12XOautWJAuDDFiJbFYnq9WBzpkUV230lnuKhUlKJtRU9lbXSojSGIQLi8LepX_Phu8BlxnVOkg8GOAQC40NKg2g3WE-1tGdHyA,,&typo=1)
> users mailing list users at lists.quantum-espresso.org
> https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.quantum-espresso.org%2fmailman%2flistinfo%2fusers&c=E,1,TNLdBqTaC8ekN7v0b1PfY8fJCd1FWca8_UUXHVf-5CQl5b-ay9pPk-X2E9lKriQUUcM3dalM445rfZw0W2XftHvT0QQaHLVstRxSUG7muw,,&typo=1

--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20241029/40b2d9c7/attachment.html>