[QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.
Dyer, Brock
brdyer at ursinus.edu
Thu Oct 24 17:38:53 CEST 2024
Hi folks, I have been trying to use the QE 7.0 GPU-accelerated version of pw.x lately, and have noticed that it is significantly (10x) slower than the CPU version. The GPU nodes I use have an AMD EPYC 7763 processor (64 cores, 128 threads) and 4 NVIDIA A100 (40gb each) GPUs, and the CPU nodes have 2 AMD EPYC 7763 processors. The time reports from runs on identical input files are below (GPU first, then CPU):
GPU Version:
init_run : 14.17s CPU 19.29s WALL ( 1 calls)
electrons : 1352.63s CPU 1498.17s WALL ( 19 calls)
update_pot : 144.15s CPU 158.77s WALL ( 18 calls)
forces : 144.74s CPU 158.92s WALL ( 19 calls)
Called by init_run:
wfcinit : 0.14s CPU 2.10s WALL ( 1 calls)
2.10s GPU ( 1 calls)
potinit : 12.83s CPU 13.78s WALL ( 1 calls)
hinit0 : 0.29s CPU 0.35s WALL ( 1 calls)
Called by electrons:
c_bands : 30.64s CPU 38.04s WALL ( 173 calls)
sum_band : 36.93s CPU 40.47s WALL ( 173 calls)
v_of_rho : 1396.71s CPU 1540.48s WALL ( 185 calls)
newd : 13.67s CPU 20.30s WALL ( 185 calls)
9.04s GPU ( 167 calls)
mix_rho : 26.02s CPU 27.31s WALL ( 173 calls)
vdW_kernel : 4.99s CPU 5.01s WALL ( 1 calls)
Called by c_bands:
init_us_2 : 0.24s CPU 0.39s WALL ( 347 calls)
init_us_2:gp : 0.23s CPU 0.38s WALL ( 347 calls)
regterg : 29.53s CPU 36.07s WALL ( 173 calls)
Called by *egterg:
rdiaghg : 0.61s CPU 1.74s WALL ( 585 calls)
1.72s GPU ( 585 calls)
h_psi : 26.71s CPU 33.73s WALL ( 611 calls)
33.69s GPU ( 611 calls)
s_psi : 0.08s CPU 0.16s WALL ( 611 calls)
0.14s GPU ( 611 calls)
g_psi : 0.00s CPU 0.04s WALL ( 437 calls)
0.04s GPU ( 437 calls)
Called by h_psi:
h_psi:calbec : 0.27s CPU 0.32s WALL ( 611 calls)
0.32s GPU ( 611 calls)
vloc_psi : 26.11s CPU 33.04s WALL ( 611 calls)
33.02s GPU ( 611 calls)
add_vuspsi : 0.06s CPU 0.14s WALL ( 611 calls)
0.13s GPU ( 611 calls)
General routines
calbec : 0.32s CPU 0.37s WALL ( 860 calls)
fft : 778.93s CPU 892.58s WALL ( 12061 calls)
13.39s GPU ( 1263 calls)
ffts : 12.40s CPU 12.96s WALL ( 173 calls)
fftw : 30.44s CPU 39.53s WALL ( 3992 calls)
38.80s GPU ( 3992 calls)
Parallel routines
PWSCF : 27m46.53s CPU 30m49.28s WALL
CPU Version:
init_run : 2.35s CPU 2.79s WALL ( 1 calls)
electrons : 99.04s CPU 142.56s WALL ( 19 calls)
update_pot : 9.01s CPU 13.47s WALL ( 18 calls)
forces : 9.89s CPU 14.35s WALL ( 19 calls)
Called by init_run:
wfcinit : 0.08s CPU 0.17s WALL ( 1 calls)
potinit : 1.27s CPU 1.50s WALL ( 1 calls)
hinit0 : 0.27s CPU 0.33s WALL ( 1 calls)
Called by electrons:
c_bands : 28.09s CPU 33.01s WALL ( 173 calls)
sum_band : 13.69s CPU 14.89s WALL ( 173 calls)
v_of_rho : 56.29s CPU 95.06s WALL ( 185 calls)
newd : 5.60s CPU 6.38s WALL ( 185 calls)
mix_rho : 1.37s CPU 1.65s WALL ( 173 calls)
vdW_kernel : 0.84s CPU 0.88s WALL ( 1 calls)
Called by c_bands:
init_us_2 : 0.54s CPU 0.62s WALL ( 347 calls)
init_us_2:cp : 0.54s CPU 0.62s WALL ( 347 calls)
regterg : 27.54s CPU 32.31s WALL ( 173 calls)
Called by *egterg:
rdiaghg : 0.45s CPU 0.49s WALL ( 584 calls)
h_psi : 23.00s CPU 27.54s WALL ( 610 calls)
s_psi : 0.64s CPU 0.66s WALL ( 610 calls)
g_psi : 0.04s CPU 0.04s WALL ( 436 calls)
Called by h_psi:
h_psi:calbec : 1.53s CPU 1.75s WALL ( 610 calls)
vloc_psi : 20.46s CPU 24.73s WALL ( 610 calls)
vloc_psi:tg_ : 1.62s CPU 1.71s WALL ( 610 calls)
add_vuspsi : 0.82s CPU 0.86s WALL ( 610 calls)
General routines
calbec : 2.20s CPU 2.52s WALL ( 859 calls)
fft : 40.10s CPU 76.07s WALL ( 12061 calls)
ffts : 0.66s CPU 0.73s WALL ( 173 calls)
fftw : 18.72s CPU 22.92s WALL ( 8916 calls)
Parallel routines
fft_scatt_xy : 15.80s CPU 20.80s WALL ( 21150 calls)
fft_scatt_yz : 27.55s CPU 58.79s WALL ( 21150 calls)
fft_scatt_tg : 3.60s CPU 4.31s WALL ( 8916 calls)
PWSCF : 2m 1.29s CPU 2m54.94s WALL
This version of QE was compiled on the Perlmutter supercomputer at NERSC. Here are the compile specifications:
# Modules
Currently Loaded Modules:
1) craype-x86-milan
2) libfabric/1.11.0.4.114
3) craype-network-ofi
4) perftools-base/22.04.0
5) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta
6) xalt/2.10.2
7) nvidia/21.11 (g,c)
8) craype/2.7.15 (c)
9) cray-dsmml/0.2.2
10) cray-mpich/8.1.15 (mpi)
11) PrgEnv-nvidia/8.3.3 (cpe)
12) Nsight-Compute/2022.1.1
13) Nsight-Systems/2022.2.1
14) cudatoolkit/11.5 (g)
15) cray-fftw/3.3.8.13 (math)
16) cray-hdf5-parallel/1.12.1.1 (io)
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cray/pe/fftw/3.3.8.13/x86_milan/lib:/opt/cray/pe/hdf5-parallel/1.12.1.1/nvidia/20.7/lib
./configure CC=cc CXX=CC FC=ftn MPIF90=ftn --with-cuda=$CUDA_HOME --with-cuda-cc=80 --with-cuda-runtime=11.0 --enable-parallel --enable-openmp --disable-shared --with-scalapack=yes FFLAGS="-Mpreprocess" FCFLAGS="-Mpreprocess" LDFLAGS="-acc" --with-libxc --with-libxc-prefix=/global/common/software/nersc/pm-2021q4/sw/libxc/v5.2.2/alv-gpu --with-hdf5=${HDF5_DIR}
make veryclean
make all
# go to EPW directory: make; then go to main binary directory and link to epw.x executable
If there is any more information required, please let me know and I will try to get it promptly!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20241024/6bef37fe/attachment.html>
More information about the users
mailing list