<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">
My current submission script uses 4 tasks per node, and my input only has 1 k-point. I feel it is pertinent to mention that I am running molecular systems, not a crystal or any sort of repeating structure. There are only 31 Kohn-Sham states in the system, and
the FFT grid is (192,192,192). I just sort of assumed that the GPU code would always be faster than CPU, maybe not by much, but definitely 8-10x slower than the CPU code. Is that an unrealistic expectation?</div>
<div id="appendonsend" style="color: inherit;"></div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">
<br>
</div>
<hr style="display: inline-block; width: 98%;">
<div id="divRplyFwdMsg" dir="ltr" class="elementToProof" style="color: inherit;">
<span style="font-family: Calibri, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);"><b>From:</b> Paolo Giannozzi <paolo.giannozzi@uniud.it><br>
</span></div>
<div style="direction: ltr; font-family: Calibri, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">
<b>Sent:</b> Monday, October 28, 2024 12:04 PM<br>
<b>To:</b> Quantum ESPRESSO users Forum <users@lists.quantum-espresso.org><br>
<b>Cc:</b> Dyer, Brock <brdyer@ursinus.edu><br>
<b>Subject:</b> Re: [QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.</div>
<div style="direction: ltr;"> </div>
<div style="font-size: 11pt;">The performances on GPU depend upon a lot of factors, e.g., the size of<br>
the system and how the code is run. One should run one MPI per GPU and<br>
use low-communication parallelization (e.g. on k points) whenever possible.<br>
<br>
Paolo<br>
<br>
On 10/24/24 17:38, Dyer, Brock wrote:<br>
><br>
> Hi folks, I have been trying to use the QE 7.0 GPU-accelerated version<br>
> of pw.x lately, and have noticed that it is significantly (10x) slower<br>
> than the CPU version. The GPU nodes I use have an AMD EPYC 7763<br>
> processor (64 cores, 128 threads) and 4 NVIDIA A100 (40gb each) GPUs,<br>
> and the CPU nodes have 2 AMD EPYC 7763 processors. The time reports from<br>
> runs on identical input files are below (GPU first, then CPU):<br>
><br>
> GPU Version:<br>
><br>
> init_run : 14.17s CPU 19.29s WALL ( 1 calls)<br>
> electrons : 1352.63s CPU 1498.17s WALL ( 19 calls)<br>
> update_pot : 144.15s CPU 158.77s WALL ( 18 calls)<br>
> forces : 144.74s CPU 158.92s WALL ( 19 calls)<br>
><br>
> Called by init_run:<br>
> wfcinit : 0.14s CPU 2.10s WALL ( 1 calls)<br>
> 2.10s GPU ( 1 calls)<br>
> potinit : 12.83s CPU 13.78s WALL ( 1 calls)<br>
> hinit0 : 0.29s CPU 0.35s WALL ( 1 calls)<br>
><br>
> Called by electrons:<br>
> c_bands : 30.64s CPU 38.04s WALL ( 173 calls)<br>
> sum_band : 36.93s CPU 40.47s WALL ( 173 calls)<br>
> v_of_rho : 1396.71s CPU 1540.48s WALL ( 185 calls)<br>
> newd : 13.67s CPU 20.30s WALL ( 185 calls)<br>
> 9.04s GPU ( 167 calls)<br>
> mix_rho : 26.02s CPU 27.31s WALL ( 173 calls)<br>
> vdW_kernel : 4.99s CPU 5.01s WALL ( 1 calls)<br>
><br>
> Called by c_bands:<br>
> init_us_2 : 0.24s CPU 0.39s WALL ( 347 calls)<br>
> init_us_2:gp : 0.23s CPU 0.38s WALL ( 347 calls)<br>
> regterg : 29.53s CPU 36.07s WALL ( 173 calls)<br>
><br>
> Called by *egterg:<br>
> rdiaghg : 0.61s CPU 1.74s WALL ( 585 calls)<br>
> 1.72s GPU ( 585 calls)<br>
> h_psi : 26.71s CPU 33.73s WALL ( 611 calls)<br>
> 33.69s GPU ( 611<br>
> calls)<br>
> s_psi : 0.08s CPU 0.16s WALL ( 611 calls)<br>
> 0.14s GPU ( 611 calls)<br>
> g_psi : 0.00s CPU 0.04s WALL ( 437 calls)<br>
> 0.04s GPU ( 437 calls)<br>
><br>
> Called by h_psi:<br>
> h_psi:calbec : 0.27s CPU 0.32s WALL ( 611 calls)<br>
> 0.32s GPU ( <br>
> 611 calls)<br>
> vloc_psi : 26.11s CPU 33.04s WALL ( 611 calls)<br>
> 33.02s GPU ( <br>
> 611 calls)<br>
> add_vuspsi : 0.06s CPU 0.14s WALL ( 611 calls)<br>
> 0.13s GPU ( 611<br>
> calls)<br>
><br>
> General routines<br>
> calbec : 0.32s CPU 0.37s WALL ( 860 calls)<br>
> fft : 778.93s CPU 892.58s WALL ( 12061 calls)<br>
> 13.39s GPU ( 1263<br>
> calls)<br>
> ffts : 12.40s CPU 12.96s WALL ( 173 calls)<br>
> fftw : 30.44s CPU 39.53s WALL ( 3992 calls)<br>
> 38.80s GPU ( 3992<br>
> calls)<br>
> Parallel routines<br>
> PWSCF : 27m46.53s CPU 30m49.28s WALL<br>
><br>
> CPU Version:<br>
><br>
><br>
> init_run : 2.35s CPU 2.79s WALL ( 1 calls)<br>
><br>
> electrons : 99.04s CPU 142.56s WALL ( 19 calls)<br>
> update_pot : 9.01s CPU 13.47s WALL ( 18 calls)<br>
> forces : 9.89s CPU 14.35s WALL ( 19 calls)<br>
><br>
> Called by init_run:<br>
> wfcinit : 0.08s CPU 0.17s WALL ( 1 calls)<br>
> potinit : 1.27s CPU 1.50s WALL ( 1 calls)<br>
> hinit0 : 0.27s CPU 0.33s WALL ( 1 calls)<br>
><br>
> Called by electrons:<br>
> c_bands : 28.09s CPU 33.01s WALL ( 173 calls)<br>
> sum_band : 13.69s CPU 14.89s WALL ( 173 calls)<br>
> v_of_rho : 56.29s CPU 95.06s WALL ( 185 calls)<br>
> newd : 5.60s CPU 6.38s WALL ( 185 calls)<br>
> mix_rho : 1.37s CPU 1.65s WALL ( 173 calls)<br>
> vdW_kernel : 0.84s CPU 0.88s WALL ( 1 calls)<br>
><br>
> Called by c_bands:<br>
> init_us_2 : 0.54s CPU 0.62s WALL ( 347 calls)<br>
> init_us_2:cp : 0.54s CPU 0.62s WALL ( 347 calls)<br>
> regterg : 27.54s CPU 32.31s WALL ( 173 calls)<br>
><br>
> Called by *egterg:<br>
> rdiaghg : 0.45s CPU 0.49s WALL ( 584 calls)<br>
> h_psi : 23.00s CPU 27.54s WALL ( 610 calls)<br>
> s_psi : 0.64s CPU 0.66s WALL ( 610 calls)<br>
> g_psi : 0.04s CPU 0.04s WALL ( 436 calls)<br>
><br>
> Called by h_psi:<br>
> h_psi:calbec : 1.53s CPU 1.75s WALL ( 610 calls)<br>
> vloc_psi : 20.46s CPU 24.73s WALL ( 610 calls)<br>
> vloc_psi:tg_ : 1.62s CPU 1.71s WALL ( 610 calls)<br>
> add_vuspsi : 0.82s CPU 0.86s WALL ( 610 calls)<br>
><br>
> General routines<br>
> calbec : 2.20s CPU 2.52s WALL ( 859 calls)<br>
> fft : 40.10s CPU 76.07s WALL ( 12061 calls)<br>
> ffts : 0.66s CPU 0.73s WALL ( 173 calls)<br>
> fftw : 18.72s CPU 22.92s WALL ( 8916 calls)<br>
><br>
> Parallel routines<br>
> fft_scatt_xy : 15.80s CPU 20.80s WALL ( 21150 calls)<br>
> fft_scatt_yz : 27.55s CPU 58.79s WALL ( 21150 calls)<br>
> fft_scatt_tg : 3.60s CPU 4.31s WALL ( 8916 calls)<br>
><br>
> PWSCF : 2m 1.29s CPU 2m54.94s WALL<br>
><br>
><br>
> This version of QE was compiled on the Perlmutter supercomputer at<br>
> NERSC. Here are the compile specifications:<br>
><br>
> # Modules<br>
><br>
><br>
> Currently Loaded Modules:<br>
> 1) craype-x86-milan<br>
> 2) libfabric/1.11.0.4.114<br>
> 3) craype-network-ofi<br>
> 4) perftools-base/22.04.0<br>
> 5) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta<br>
> 6) xalt/2.10.2<br>
> 7) nvidia/21.11 (g,c)<br>
> 8) craype/2.7.15 (c)<br>
> 9) cray-dsmml/0.2.2<br>
> 10) cray-mpich/8.1.15 (mpi)<br>
> 11) PrgEnv-nvidia/8.3.3 (cpe)<br>
> 12) Nsight-Compute/2022.1.1<br>
> 13) Nsight-Systems/2022.2.1<br>
> 14) cudatoolkit/11.5 (g)<br>
> 15) cray-fftw/3.3.8.13 (math)<br>
> 16) cray-hdf5-parallel/1.12.1.1 (io)<br>
><br>
> export<br>
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cray/pe/fftw/3.3.8.13/x86_milan/lib:/opt/cray/pe/hdf5-parallel/1.12.1.1/nvidia/20.7/lib<br>
><br>
> ./configure CC=cc CXX=CC FC=ftn MPIF90=ftn --with-cuda=$CUDA_HOME<br>
> --with-cuda-cc=80 --with-cuda-runtime=11.0 --enable-parallel<br>
> --enable-openmp --disable-shared --with-scalapack=yes<br>
> FFLAGS="-Mpreprocess" FCFLAGS="-Mpreprocess" LDFLAGS="-acc" --with-libxc<br>
> --with-libxc-prefix=/global/common/software/nersc/pm-2021q4/sw/libxc/v5.2.2/alv-gpu --with-hdf5=${HDF5_DIR}<br>
><br>
> make veryclean<br>
> make all<br>
><br>
> # go to EPW directory: make; then go to main binary directory and link<br>
> to epw.x executable<br>
><br>
><br>
> If there is any more information required, please let me know and I will<br>
> try to get it promptly!<br>
><br>
><br>
><br>
> _______________________________________________<br>
> The Quantum ESPRESSO community stands by the Ukrainian<br>
> people and expresses its concerns about the devastating<br>
> effects that the Russian military offensive has on their<br>
> country and on the free and peaceful scientific, cultural,<br>
> and economic cooperation amongst peoples<br>
> _______________________________________________<br>
> Quantum ESPRESSO is supported by MaX (<a href="https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.max-centre.eu&c=E,1,cWbnpY9zUPefGXxiZSF12XOautWJAuDDFiJbFYnq9WBzpkUV230lnuKhUlKJtRU9lbXSojSGIQLi8LepX_Phu8BlxnVOkg8GOAQC40NKg2g3WE-1tGdHyA,,&typo=1" id="OWA4fc5e8bd-9d81-2e59-e6bc-2eff654ca934" class="OWAAutoLink" data-auth="NotApplicable">https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.max-centre.eu&c=E,1,cWbnpY9zUPefGXxiZSF12XOautWJAuDDFiJbFYnq9WBzpkUV230lnuKhUlKJtRU9lbXSojSGIQLi8LepX_Phu8BlxnVOkg8GOAQC40NKg2g3WE-1tGdHyA,,&typo=1</a>)<br>
> users mailing list users@lists.quantum-espresso.org<br>
> <a href="https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.quantum-espresso.org%2fmailman%2flistinfo%2fusers&c=E,1,TNLdBqTaC8ekN7v0b1PfY8fJCd1FWca8_UUXHVf-5CQl5b-ay9pPk-X2E9lKriQUUcM3dalM445rfZw0W2XftHvT0QQaHLVstRxSUG7muw,,&typo=1" id="OWA96e0c8c5-70b0-130c-03f6-72788b562cc6" class="OWAAutoLink" data-auth="NotApplicable">
https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.quantum-espresso.org%2fmailman%2flistinfo%2fusers&c=E,1,TNLdBqTaC8ekN7v0b1PfY8fJCd1FWca8_UUXHVf-5CQl5b-ay9pPk-X2E9lKriQUUcM3dalM445rfZw0W2XftHvT0QQaHLVstRxSUG7muw,,&typo=1</a><br>
<br>
--<br>
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,<br>
Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216</div>
</body>
</html>