[QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.
Dyer, Brock
brdyer at ursinus.edu
Tue Oct 29 18:52:08 CET 2024
My current submission script uses 4 tasks per node, and my input only has 1 k-point. I feel it is pertinent to mention that I am running molecular systems, not a crystal or any sort of repeating structure. There are only 31 Kohn-Sham states in the system, and the FFT grid is (192,192,192). I just sort of assumed that the GPU code would always be faster than CPU, maybe not by much, but definitely 8-10x slower than the CPU code. Is that an unrealistic expectation?
________________________________
From: Paolo Giannozzi <paolo.giannozzi at uniud.it>
Sent: Monday, October 28, 2024 12:04 PM
To: Quantum ESPRESSO users Forum <users at lists.quantum-espresso.org>
Cc: Dyer, Brock <brdyer at ursinus.edu>
Subject: Re: [QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.
The performances on GPU depend upon a lot of factors, e.g., the size of
the system and how the code is run. One should run one MPI per GPU and
use low-communication parallelization (e.g. on k points) whenever possible.
Paolo
On 10/24/24 17:38, Dyer, Brock wrote:
>
> Hi folks, I have been trying to use the QE 7.0 GPU-accelerated version
> of pw.x lately, and have noticed that it is significantly (10x) slower
> than the CPU version. The GPU nodes I use have an AMD EPYC 7763
> processor (64 cores, 128 threads) and 4 NVIDIA A100 (40gb each) GPUs,
> and the CPU nodes have 2 AMD EPYC 7763 processors. The time reports from
> runs on identical input files are below (GPU first, then CPU):
>
> GPU Version:
>
> init_run : 14.17s CPU 19.29s WALL ( 1 calls)
> electrons : 1352.63s CPU 1498.17s WALL ( 19 calls)
> update_pot : 144.15s CPU 158.77s WALL ( 18 calls)
> forces : 144.74s CPU 158.92s WALL ( 19 calls)
>
> Called by init_run:
> wfcinit : 0.14s CPU 2.10s WALL ( 1 calls)
> 2.10s GPU ( 1 calls)
> potinit : 12.83s CPU 13.78s WALL ( 1 calls)
> hinit0 : 0.29s CPU 0.35s WALL ( 1 calls)
>
> Called by electrons:
> c_bands : 30.64s CPU 38.04s WALL ( 173 calls)
> sum_band : 36.93s CPU 40.47s WALL ( 173 calls)
> v_of_rho : 1396.71s CPU 1540.48s WALL ( 185 calls)
> newd : 13.67s CPU 20.30s WALL ( 185 calls)
> 9.04s GPU ( 167 calls)
> mix_rho : 26.02s CPU 27.31s WALL ( 173 calls)
> vdW_kernel : 4.99s CPU 5.01s WALL ( 1 calls)
>
> Called by c_bands:
> init_us_2 : 0.24s CPU 0.39s WALL ( 347 calls)
> init_us_2:gp : 0.23s CPU 0.38s WALL ( 347 calls)
> regterg : 29.53s CPU 36.07s WALL ( 173 calls)
>
> Called by *egterg:
> rdiaghg : 0.61s CPU 1.74s WALL ( 585 calls)
> 1.72s GPU ( 585 calls)
> h_psi : 26.71s CPU 33.73s WALL ( 611 calls)
> 33.69s GPU ( 611
> calls)
> s_psi : 0.08s CPU 0.16s WALL ( 611 calls)
> 0.14s GPU ( 611 calls)
> g_psi : 0.00s CPU 0.04s WALL ( 437 calls)
> 0.04s GPU ( 437 calls)
>
> Called by h_psi:
> h_psi:calbec : 0.27s CPU 0.32s WALL ( 611 calls)
> 0.32s GPU (
> 611 calls)
> vloc_psi : 26.11s CPU 33.04s WALL ( 611 calls)
> 33.02s GPU (
> 611 calls)
> add_vuspsi : 0.06s CPU 0.14s WALL ( 611 calls)
> 0.13s GPU ( 611
> calls)
>
> General routines
> calbec : 0.32s CPU 0.37s WALL ( 860 calls)
> fft : 778.93s CPU 892.58s WALL ( 12061 calls)
> 13.39s GPU ( 1263
> calls)
> ffts : 12.40s CPU 12.96s WALL ( 173 calls)
> fftw : 30.44s CPU 39.53s WALL ( 3992 calls)
> 38.80s GPU ( 3992
> calls)
> Parallel routines
> PWSCF : 27m46.53s CPU 30m49.28s WALL
>
> CPU Version:
>
>
> init_run : 2.35s CPU 2.79s WALL ( 1 calls)
>
> electrons : 99.04s CPU 142.56s WALL ( 19 calls)
> update_pot : 9.01s CPU 13.47s WALL ( 18 calls)
> forces : 9.89s CPU 14.35s WALL ( 19 calls)
>
> Called by init_run:
> wfcinit : 0.08s CPU 0.17s WALL ( 1 calls)
> potinit : 1.27s CPU 1.50s WALL ( 1 calls)
> hinit0 : 0.27s CPU 0.33s WALL ( 1 calls)
>
> Called by electrons:
> c_bands : 28.09s CPU 33.01s WALL ( 173 calls)
> sum_band : 13.69s CPU 14.89s WALL ( 173 calls)
> v_of_rho : 56.29s CPU 95.06s WALL ( 185 calls)
> newd : 5.60s CPU 6.38s WALL ( 185 calls)
> mix_rho : 1.37s CPU 1.65s WALL ( 173 calls)
> vdW_kernel : 0.84s CPU 0.88s WALL ( 1 calls)
>
> Called by c_bands:
> init_us_2 : 0.54s CPU 0.62s WALL ( 347 calls)
> init_us_2:cp : 0.54s CPU 0.62s WALL ( 347 calls)
> regterg : 27.54s CPU 32.31s WALL ( 173 calls)
>
> Called by *egterg:
> rdiaghg : 0.45s CPU 0.49s WALL ( 584 calls)
> h_psi : 23.00s CPU 27.54s WALL ( 610 calls)
> s_psi : 0.64s CPU 0.66s WALL ( 610 calls)
> g_psi : 0.04s CPU 0.04s WALL ( 436 calls)
>
> Called by h_psi:
> h_psi:calbec : 1.53s CPU 1.75s WALL ( 610 calls)
> vloc_psi : 20.46s CPU 24.73s WALL ( 610 calls)
> vloc_psi:tg_ : 1.62s CPU 1.71s WALL ( 610 calls)
> add_vuspsi : 0.82s CPU 0.86s WALL ( 610 calls)
>
> General routines
> calbec : 2.20s CPU 2.52s WALL ( 859 calls)
> fft : 40.10s CPU 76.07s WALL ( 12061 calls)
> ffts : 0.66s CPU 0.73s WALL ( 173 calls)
> fftw : 18.72s CPU 22.92s WALL ( 8916 calls)
>
> Parallel routines
> fft_scatt_xy : 15.80s CPU 20.80s WALL ( 21150 calls)
> fft_scatt_yz : 27.55s CPU 58.79s WALL ( 21150 calls)
> fft_scatt_tg : 3.60s CPU 4.31s WALL ( 8916 calls)
>
> PWSCF : 2m 1.29s CPU 2m54.94s WALL
>
>
> This version of QE was compiled on the Perlmutter supercomputer at
> NERSC. Here are the compile specifications:
>
> # Modules
>
>
> Currently Loaded Modules:
> 1) craype-x86-milan
> 2) libfabric/1.11.0.4.114
> 3) craype-network-ofi
> 4) perftools-base/22.04.0
> 5) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta
> 6) xalt/2.10.2
> 7) nvidia/21.11 (g,c)
> 8) craype/2.7.15 (c)
> 9) cray-dsmml/0.2.2
> 10) cray-mpich/8.1.15 (mpi)
> 11) PrgEnv-nvidia/8.3.3 (cpe)
> 12) Nsight-Compute/2022.1.1
> 13) Nsight-Systems/2022.2.1
> 14) cudatoolkit/11.5 (g)
> 15) cray-fftw/3.3.8.13 (math)
> 16) cray-hdf5-parallel/1.12.1.1 (io)
>
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cray/pe/fftw/3.3.8.13/x86_milan/lib:/opt/cray/pe/hdf5-parallel/1.12.1.1/nvidia/20.7/lib
>
> ./configure CC=cc CXX=CC FC=ftn MPIF90=ftn --with-cuda=$CUDA_HOME
> --with-cuda-cc=80 --with-cuda-runtime=11.0 --enable-parallel
> --enable-openmp --disable-shared --with-scalapack=yes
> FFLAGS="-Mpreprocess" FCFLAGS="-Mpreprocess" LDFLAGS="-acc" --with-libxc
> --with-libxc-prefix=/global/common/software/nersc/pm-2021q4/sw/libxc/v5.2.2/alv-gpu --with-hdf5=${HDF5_DIR}
>
> make veryclean
> make all
>
> # go to EPW directory: make; then go to main binary directory and link
> to epw.x executable
>
>
> If there is any more information required, please let me know and I will
> try to get it promptly!
>
>
>
> _______________________________________________
> The Quantum ESPRESSO community stands by the Ukrainian
> people and expresses its concerns about the devastating
> effects that the Russian military offensive has on their
> country and on the free and peaceful scientific, cultural,
> and economic cooperation amongst peoples
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.max-centre.eu&c=E,1,cWbnpY9zUPefGXxiZSF12XOautWJAuDDFiJbFYnq9WBzpkUV230lnuKhUlKJtRU9lbXSojSGIQLi8LepX_Phu8BlxnVOkg8GOAQC40NKg2g3WE-1tGdHyA,,&typo=1)
> users mailing list users at lists.quantum-espresso.org
> https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.quantum-espresso.org%2fmailman%2flistinfo%2fusers&c=E,1,TNLdBqTaC8ekN7v0b1PfY8fJCd1FWca8_UUXHVf-5CQl5b-ay9pPk-X2E9lKriQUUcM3dalM445rfZw0W2XftHvT0QQaHLVstRxSUG7muw,,&typo=1
--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20241029/40b2d9c7/attachment.html>
More information about the users
mailing list