[QE-users] Poor GPU scaling for Gamma-point-only calculation on multiple GPUs

Tue May 13 21:05:10 CEST 2025

Dear Paolo,

Thank you very much for your kind reply.

> 72 OpenMP threads are too many, and in any case: is your compute node a 4*72=288 CPU one?

Yes, my compute node has 288 CPUs.

>  1. how to convince your machine to run the code the way it should (and not the way it shouldn't)

I tested the same slurm script for Si with 125 atoms with 5x5x5 kpoints, and with "-npool 4" for 4 GPUs, it is ~15x faster compared to running on MPI-only 128 CPUs. Very nice! Although I saw the same warning message, "High GPU oversubscription detected. Are you sure this is what you want?". The significant speedup suggests the setup of the GPUs should be correct.

However, the speedup is not observed for the gamma-point only calculation, and it worsens the performance when using 2 GPUs compared to 1 GPU.

> Once you are sure that the code runs the way it should, have a look at the time reports for 1, 2, 4 GPUs, with no OpenMP threads.

I ran the tests without an OpenMP thread, and they show similar trends to those with 72 OpenMP threads. Indeed, the OpenMPI threads does not help a lot.

GPU   |     72 Open MP threads  |  no OpenMP thread |
1          |  11.9 secs                           |   13.8 secs     |
2          |    7.0 secs                           |   18.0 secs    |
4          |     9.7 secs                          |   10.9 secs    |

> Note that at Gamma point you have no "easy" parallelization levels to exploit.

Yes. I use multiple GPUs mainly to avoid the "out of memory" issue, but it turn out that this can harm the performance in the gamma-point only case.

Best,
Xing

Scientist, Paul Scherrer Institute (PSI)

________________________________
From: Paolo Giannozzi <paolo.giannozzi at uniud.it>
Sent: Tuesday, May 13, 2025 7:58 PM
To: Quantum ESPRESSO users Forum <users at lists.quantum-espresso.org>; Wang Xing <xing.wang at psi.ch>
Subject: Re: [QE-users] Poor GPU scaling for Gamma-point-only calculation on multiple GPUs

On 13/05/2025 08:47, Wang Xing wrote:

> #SBATCH --ntasks-per-node=4
> #SBATCH --cpus-per-task=72

72 OpenMP threads are too many, and in any case: is your compute node a
4*72=288 CPU one?

> srun pw.x -pd .true. -npool 1 -in aiida.in > aiida.out

-pd has no effect for GPUs, I think

> GPU acceleration is ACTIVE. 1 visible GPUs per MPI rank

this doesn't look right to me: it should say 4 (but I don't know how
reliable this message is)

> GPU-aware MPI enabled
> Message from routine print_cuda_info:
>    High GPU oversubscription detected. Are you sure this is what you want?

this also doesn't look right: it seems to indicate that all four MPI
processes access a single GPU (but I don't know how reliable this
message is)

> Does anyone have experience optimizing Gamma-point-only calculations on
> multiple GPUs? Is there a known bottleneck or best practice for using
> multiple GPUs efficiently in such a case?

there are two distinct aspects here:
1. how to convince your machine to run the code the way it should (and
not the way it shouldn't) and
2. how to optimized the parallelization over GPUs.
I can't say anything about the former point: it is a task for system
administrators. Once you are sure that the code runs the way it should,
have a look at the time reports for 1, 2, 4 GPUs, with no OpenMP
threads. You should easily spot anomalies or bottlenecks. Note that at
Gamma point you have no "easy" parallelization levels to exploit.

Paolo

> Any insights would be greatly appreciated.
> Best,
> Xing
>
>
> _______________________________________________________________________________
> The Quantum ESPRESSO Foundation stands in solidarity with all civilians worldwide who are victims of terrorism, military aggression, and indiscriminate warfare.
> --------------------------------------------------------------------------------
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu<http://www.max-centre.eu>)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users

--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20250513/e068a542/attachment.html>