[QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.

Dyer, Brock brdyer at ursinus.edu
Thu Oct 24 17:38:53 CEST 2024


Hi folks, I have been trying to use the QE 7.0 GPU-accelerated version of pw.x lately, and have noticed that it is significantly (10x) slower than the CPU version. The GPU nodes I use have an AMD EPYC 7763 processor (64 cores, 128 threads) and 4 NVIDIA A100 (40gb each) GPUs, and the CPU nodes have 2 AMD EPYC 7763 processors. The time reports from runs on identical input files are below (GPU first, then CPU):

GPU Version:

     init_run     :     14.17s CPU     19.29s WALL (       1 calls)
     electrons    :   1352.63s CPU   1498.17s WALL (      19 calls)
     update_pot   :    144.15s CPU    158.77s WALL (      18 calls)
     forces       :    144.74s CPU    158.92s WALL (      19 calls)

     Called by init_run:
     wfcinit      :      0.14s CPU      2.10s WALL (       1 calls)
                                        2.10s GPU  (       1 calls)
     potinit      :     12.83s CPU     13.78s WALL (       1 calls)
     hinit0       :      0.29s CPU      0.35s WALL (       1 calls)

     Called by electrons:
     c_bands      :     30.64s CPU     38.04s WALL (     173 calls)
     sum_band     :     36.93s CPU     40.47s WALL (     173 calls)
     v_of_rho     :   1396.71s CPU   1540.48s WALL (     185 calls)
     newd         :     13.67s CPU     20.30s WALL (     185 calls)
                                        9.04s GPU  (     167 calls)
     mix_rho      :     26.02s CPU     27.31s WALL (     173 calls)
     vdW_kernel   :      4.99s CPU      5.01s WALL (       1 calls)

     Called by c_bands:
     init_us_2    :      0.24s CPU      0.39s WALL (     347 calls)
     init_us_2:gp :      0.23s CPU      0.38s WALL (     347 calls)
     regterg      :     29.53s CPU     36.07s WALL (     173 calls)

     Called by *egterg:
     rdiaghg      :      0.61s CPU      1.74s WALL (     585 calls)
                                                1.72s GPU  (     585 calls)
     h_psi        :     26.71s CPU     33.73s WALL (     611 calls)
                                                33.69s GPU  (     611 calls)
     s_psi        :      0.08s CPU      0.16s WALL (     611 calls)
                                                0.14s GPU  (     611 calls)
     g_psi        :      0.00s CPU      0.04s WALL (     437 calls)
                                                0.04s GPU  (     437 calls)

     Called by h_psi:
     h_psi:calbec :      0.27s CPU      0.32s WALL (     611 calls)
                                                    0.32s GPU  (     611 calls)
     vloc_psi     :     26.11s CPU     33.04s WALL (     611 calls)
                                                   33.02s GPU  (     611 calls)
     add_vuspsi   :      0.06s CPU      0.14s WALL (     611 calls)
                                                   0.13s GPU  (     611 calls)

     General routines
     calbec       :      0.32s CPU      0.37s WALL (     860 calls)
     fft          :    778.93s CPU    892.58s WALL (   12061 calls)
                                                13.39s GPU  (    1263 calls)
     ffts         :     12.40s CPU     12.96s WALL (     173 calls)
     fftw         :     30.44s CPU     39.53s WALL (    3992 calls)
                                                38.80s GPU  (    3992 calls)

     Parallel routines

     PWSCF        :  27m46.53s CPU  30m49.28s WALL


CPU Version:


     init_run     :      2.35s CPU      2.79s WALL (       1 calls)

     electrons    :     99.04s CPU    142.56s WALL (      19 calls)
     update_pot   :      9.01s CPU     13.47s WALL (      18 calls)
     forces       :      9.89s CPU     14.35s WALL (      19 calls)

     Called by init_run:
     wfcinit      :      0.08s CPU      0.17s WALL (       1 calls)
     potinit      :      1.27s CPU      1.50s WALL (       1 calls)
     hinit0       :      0.27s CPU      0.33s WALL (       1 calls)

     Called by electrons:
     c_bands      :     28.09s CPU     33.01s WALL (     173 calls)
     sum_band     :     13.69s CPU     14.89s WALL (     173 calls)
     v_of_rho     :     56.29s CPU     95.06s WALL (     185 calls)
     newd         :      5.60s CPU      6.38s WALL (     185 calls)
     mix_rho      :      1.37s CPU      1.65s WALL (     173 calls)
     vdW_kernel   :      0.84s CPU      0.88s WALL (       1 calls)

     Called by c_bands:
     init_us_2    :      0.54s CPU      0.62s WALL (     347 calls)
     init_us_2:cp :      0.54s CPU      0.62s WALL (     347 calls)
     regterg      :     27.54s CPU     32.31s WALL (     173 calls)

     Called by *egterg:
     rdiaghg      :      0.45s CPU      0.49s WALL (     584 calls)
     h_psi        :     23.00s CPU     27.54s WALL (     610 calls)
     s_psi        :      0.64s CPU      0.66s WALL (     610 calls)
     g_psi        :      0.04s CPU      0.04s WALL (     436 calls)

     Called by h_psi:
     h_psi:calbec :      1.53s CPU      1.75s WALL (     610 calls)
     vloc_psi     :     20.46s CPU     24.73s WALL (     610 calls)
     vloc_psi:tg_ :      1.62s CPU      1.71s WALL (     610 calls)
     add_vuspsi   :      0.82s CPU      0.86s WALL (     610 calls)

     General routines
     calbec       :      2.20s CPU      2.52s WALL (     859 calls)
     fft          :     40.10s CPU     76.07s WALL (   12061 calls)
     ffts         :      0.66s CPU      0.73s WALL (     173 calls)
     fftw         :     18.72s CPU     22.92s WALL (    8916 calls)

     Parallel routines
     fft_scatt_xy :     15.80s CPU     20.80s WALL (   21150 calls)
     fft_scatt_yz :     27.55s CPU     58.79s WALL (   21150 calls)
     fft_scatt_tg :      3.60s CPU      4.31s WALL (    8916 calls)

     PWSCF        :   2m 1.29s CPU   2m54.94s WALL


This version of QE was compiled on the Perlmutter supercomputer at NERSC. Here are the compile specifications:

# Modules

Currently Loaded Modules:
  1) craype-x86-milan
  2) libfabric/1.11.0.4.114
  3) craype-network-ofi
  4) perftools-base/22.04.0
  5) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta
  6) xalt/2.10.2
  7) nvidia/21.11                         (g,c)
  8) craype/2.7.15                        (c)
  9) cray-dsmml/0.2.2
 10) cray-mpich/8.1.15                    (mpi)
 11) PrgEnv-nvidia/8.3.3                  (cpe)
 12) Nsight-Compute/2022.1.1
 13) Nsight-Systems/2022.2.1
 14) cudatoolkit/11.5                     (g)
 15) cray-fftw/3.3.8.13                   (math)
 16) cray-hdf5-parallel/1.12.1.1          (io)

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cray/pe/fftw/3.3.8.13/x86_milan/lib:/opt/cray/pe/hdf5-parallel/1.12.1.1/nvidia/20.7/lib

./configure CC=cc CXX=CC FC=ftn MPIF90=ftn --with-cuda=$CUDA_HOME --with-cuda-cc=80 --with-cuda-runtime=11.0 --enable-parallel --enable-openmp --disable-shared --with-scalapack=yes FFLAGS="-Mpreprocess" FCFLAGS="-Mpreprocess" LDFLAGS="-acc" --with-libxc --with-libxc-prefix=/global/common/software/nersc/pm-2021q4/sw/libxc/v5.2.2/alv-gpu --with-hdf5=${HDF5_DIR}

make veryclean
make all

# go to EPW directory: make; then go to main binary directory and link to epw.x executable


If there is any more information required, please let me know and I will try to get it promptly!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20241024/6bef37fe/attachment.html>


More information about the users mailing list