[QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.
Paolo Giannozzi
paolo.giannozzi at uniud.it
Tue Oct 29 21:56:44 CET 2024
On 29/10/2024 18:52, Dyer, Brock wrote:
> My current submission script uses 4 tasks per node, and my input only
> has 1 k-point. I feel it is pertinent to mention that I am running
> molecular systems, not a crystal or any sort of repeating structure.
> There are only 31 Kohn-Sham states in the system, and the FFT grid is
> (192,192,192).
your system is not one of the best performing one on GPUs, having few
Kohn-Sham states, one k-point and a large FFT grid for the charge
density and potentials. Moreover, your code version (7.0) has limited
GPU porting. In fact, the vast majority of the time is spent in
v_of_rho, that computes the potential from the charge density. This part
has been ported to GPU in later versions.
Paolo
> ------------------------------------------------------------------------
> *From:* Paolo Giannozzi <paolo.giannozzi at uniud.it>
> *Sent:* Monday, October 28, 2024 12:04 PM
> *To:* Quantum ESPRESSO users Forum <users at lists.quantum-espresso.org>
> *Cc:* Dyer, Brock <brdyer at ursinus.edu>
> *Subject:* Re: [QE-users] [QE-GPU] GPU runs significantly slower than
> CPU runs.
> The performances on GPU depend upon a lot of factors, e.g., the size of
> the system and how the code is run. One should run one MPI per GPU and
> use low-communication parallelization (e.g. on k points) whenever possible.
>
> Paolo
>
> On 10/24/24 17:38, Dyer, Brock wrote:
> >
> > Hi folks, I have been trying to use the QE 7.0 GPU-accelerated version
> > of pw.x lately, and have noticed that it is significantly (10x) slower
> > than the CPU version. The GPU nodes I use have an AMD EPYC 7763
> > processor (64 cores, 128 threads) and 4 NVIDIA A100 (40gb each) GPUs,
> > and the CPU nodes have 2 AMD EPYC 7763 processors. The time reports from
> > runs on identical input files are below (GPU first, then CPU):
> >
> > GPU Version:
> >
> > init_run : 14.17s CPU 19.29s WALL ( 1 calls)
> > electrons : 1352.63s CPU 1498.17s WALL ( 19 calls)
> > update_pot : 144.15s CPU 158.77s WALL ( 18 calls)
> > forces : 144.74s CPU 158.92s WALL ( 19 calls)
> >
> > Called by init_run:
> > wfcinit : 0.14s CPU 2.10s WALL ( 1 calls)
> > 2.10s GPU ( 1 calls)
> > potinit : 12.83s CPU 13.78s WALL ( 1 calls)
> > hinit0 : 0.29s CPU 0.35s WALL ( 1 calls)
> >
> > Called by electrons:
> > c_bands : 30.64s CPU 38.04s WALL ( 173 calls)
> > sum_band : 36.93s CPU 40.47s WALL ( 173 calls)
> > v_of_rho : 1396.71s CPU 1540.48s WALL ( 185 calls)
> > newd : 13.67s CPU 20.30s WALL ( 185 calls)
> > 9.04s GPU ( 167 calls)
> > mix_rho : 26.02s CPU 27.31s WALL ( 173 calls)
> > vdW_kernel : 4.99s CPU 5.01s WALL ( 1 calls)
> >
> > Called by c_bands:
> > init_us_2 : 0.24s CPU 0.39s WALL ( 347 calls)
> > init_us_2:gp : 0.23s CPU 0.38s WALL ( 347 calls)
> > regterg : 29.53s CPU 36.07s WALL ( 173 calls)
> >
> > Called by *egterg:
> > rdiaghg : 0.61s CPU 1.74s WALL ( 585 calls)
> > 1.72s GPU ( 585
> calls)
> > h_psi : 26.71s CPU 33.73s WALL ( 611 calls)
> > 33.69s GPU ( 611
> > calls)
> > s_psi : 0.08s CPU 0.16s WALL ( 611 calls)
> > 0.14s GPU ( 611
> calls)
> > g_psi : 0.00s CPU 0.04s WALL ( 437 calls)
> > 0.04s GPU ( 437
> calls)
> >
> > Called by h_psi:
> > h_psi:calbec : 0.27s CPU 0.32s WALL ( 611 calls)
> > 0.32s GPU (
> > 611 calls)
> > vloc_psi : 26.11s CPU 33.04s WALL ( 611 calls)
> > 33.02s GPU (
> > 611 calls)
> > add_vuspsi : 0.06s CPU 0.14s WALL ( 611 calls)
> > 0.13s GPU ( 611
> > calls)
> >
> > General routines
> > calbec : 0.32s CPU 0.37s WALL ( 860 calls)
> > fft : 778.93s CPU 892.58s WALL ( 12061 calls)
> > 13.39s GPU ( 1263
> > calls)
> > ffts : 12.40s CPU 12.96s WALL ( 173 calls)
> > fftw : 30.44s CPU 39.53s WALL ( 3992 calls)
> > 38.80s GPU ( 3992
> > calls)
> > Parallel routines
> > PWSCF : 27m46.53s CPU 30m49.28s WALL
> >
> > CPU Version:
> >
> >
> > init_run : 2.35s CPU 2.79s WALL ( 1 calls)
> >
> > electrons : 99.04s CPU 142.56s WALL ( 19 calls)
> > update_pot : 9.01s CPU 13.47s WALL ( 18 calls)
> > forces : 9.89s CPU 14.35s WALL ( 19 calls)
> >
> > Called by init_run:
> > wfcinit : 0.08s CPU 0.17s WALL ( 1 calls)
> > potinit : 1.27s CPU 1.50s WALL ( 1 calls)
> > hinit0 : 0.27s CPU 0.33s WALL ( 1 calls)
> >
> > Called by electrons:
> > c_bands : 28.09s CPU 33.01s WALL ( 173 calls)
> > sum_band : 13.69s CPU 14.89s WALL ( 173 calls)
> > v_of_rho : 56.29s CPU 95.06s WALL ( 185 calls)
> > newd : 5.60s CPU 6.38s WALL ( 185 calls)
> > mix_rho : 1.37s CPU 1.65s WALL ( 173 calls)
> > vdW_kernel : 0.84s CPU 0.88s WALL ( 1 calls)
> >
> > Called by c_bands:
> > init_us_2 : 0.54s CPU 0.62s WALL ( 347 calls)
> > init_us_2:cp : 0.54s CPU 0.62s WALL ( 347 calls)
> > regterg : 27.54s CPU 32.31s WALL ( 173 calls)
> >
> > Called by *egterg:
> > rdiaghg : 0.45s CPU 0.49s WALL ( 584 calls)
> > h_psi : 23.00s CPU 27.54s WALL ( 610 calls)
> > s_psi : 0.64s CPU 0.66s WALL ( 610 calls)
> > g_psi : 0.04s CPU 0.04s WALL ( 436 calls)
> >
> > Called by h_psi:
> > h_psi:calbec : 1.53s CPU 1.75s WALL ( 610 calls)
> > vloc_psi : 20.46s CPU 24.73s WALL ( 610 calls)
> > vloc_psi:tg_ : 1.62s CPU 1.71s WALL ( 610 calls)
> > add_vuspsi : 0.82s CPU 0.86s WALL ( 610 calls)
> >
> > General routines
> > calbec : 2.20s CPU 2.52s WALL ( 859 calls)
> > fft : 40.10s CPU 76.07s WALL ( 12061 calls)
> > ffts : 0.66s CPU 0.73s WALL ( 173 calls)
> > fftw : 18.72s CPU 22.92s WALL ( 8916 calls)
> >
> > Parallel routines
> > fft_scatt_xy : 15.80s CPU 20.80s WALL ( 21150 calls)
> > fft_scatt_yz : 27.55s CPU 58.79s WALL ( 21150 calls)
> > fft_scatt_tg : 3.60s CPU 4.31s WALL ( 8916 calls)
> >
> > PWSCF : 2m 1.29s CPU 2m54.94s WALL
> >
> >
> > This version of QE was compiled on the Perlmutter supercomputer at
> > NERSC. Here are the compile specifications:
> >
> > # Modules
> >
> >
> > Currently Loaded Modules:
> > 1) craype-x86-milan
> > 2) libfabric/1.11.0.4.114
> > 3) craype-network-ofi
> > 4) perftools-base/22.04.0
> > 5) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta
> > 6) xalt/2.10.2
> > 7) nvidia/21.11 (g,c)
> > 8) craype/2.7.15 (c)
> > 9) cray-dsmml/0.2.2
> > 10) cray-mpich/8.1.15 (mpi)
> > 11) PrgEnv-nvidia/8.3.3 (cpe)
> > 12) Nsight-Compute/2022.1.1
> > 13) Nsight-Systems/2022.2.1
> > 14) cudatoolkit/11.5 (g)
> > 15) cray-fftw/3.3.8.13 (math)
> > 16) cray-hdf5-parallel/1.12.1.1 (io)
> >
> > export
> > LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cray/pe/fftw/3.3.8.13/
> x86_milan/lib:/opt/cray/pe/hdf5-parallel/1.12.1.1/nvidia/20.7/lib
> >
> > ./configure CC=cc CXX=CC FC=ftn MPIF90=ftn --with-cuda=$CUDA_HOME
> > --with-cuda-cc=80 --with-cuda-runtime=11.0 --enable-parallel
> > --enable-openmp --disable-shared --with-scalapack=yes
> > FFLAGS="-Mpreprocess" FCFLAGS="-Mpreprocess" LDFLAGS="-acc" --with-libxc
> > --with-libxc-prefix=/global/common/software/nersc/pm-2021q4/sw/libxc/
> v5.2.2/alv-gpu --with-hdf5=${HDF5_DIR}
> >
> > make veryclean
> > make all
> >
> > # go to EPW directory: make; then go to main binary directory and link
> > to epw.x executable
> >
> >
> > If there is any more information required, please let me know and I will
> > try to get it promptly!
> >
> >
> >
> > _______________________________________________
> > The Quantum ESPRESSO community stands by the Ukrainian
> > people and expresses its concerns about the devastating
> > effects that the Russian military offensive has on their
> > country and on the free and peaceful scientific, cultural,
> > and economic cooperation amongst peoples
> > _______________________________________________
> > Quantum ESPRESSO is supported by MaX (https://
> linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.max-
> centre.eu&c=E,1,cWbnpY9zUPefGXxiZSF12XOautWJAuDDFiJbFYnq9WBzpkUV230lnuKhUlKJtRU9lbXSojSGIQLi8LepX_Phu8BlxnVOkg8GOAQC40NKg2g3WE-1tGdHyA,,&typo=1 <https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.max-centre.eu&c=E,1,cWbnpY9zUPefGXxiZSF12XOautWJAuDDFiJbFYnq9WBzpkUV230lnuKhUlKJtRU9lbXSojSGIQLi8LepX_Phu8BlxnVOkg8GOAQC40NKg2g3WE-1tGdHyA,,&typo=1>)
> > users mailing list users at lists.quantum-espresso.org
> > https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.quantum-
> espresso.org%2fmailman%2flistinfo%2fusers&c=E,1,TNLdBqTaC8ekN7v0b1PfY8fJCd1FWca8_UUXHVf-5CQl5b-ay9pPk-X2E9lKriQUUcM3dalM445rfZw0W2XftHvT0QQaHLVstRxSUG7muw,,&typo=1 <https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.quantum-espresso.org%2fmailman%2flistinfo%2fusers&c=E,1,TNLdBqTaC8ekN7v0b1PfY8fJCd1FWca8_UUXHVf-5CQl5b-ay9pPk-X2E9lKriQUUcM3dalM445rfZw0W2XftHvT0QQaHLVstRxSUG7muw,,&typo=1>
>
> --
> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216
--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216
More information about the users
mailing list