<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>

</head>

<body dir="ltr">

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Dear Paolo,</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Thank you very much for your kind reply.</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

> 72 OpenMP threads are too many, and in any case: is your compute node a 4*72=288 CPU one?</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Yes, my compute node has 288 CPUs.</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

>  1. how to convince your machine to run the code the way it should (and not the way it shouldn't)</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

I tested the same slurm script for Si with 125 atoms with 5x5x5 kpoints, and with "-npool 4" for 4 GPUs, it is ~15x faster compared to running on MPI-only 128 CPUs. Very nice! Although I saw the same warning message, "High GPU oversubscription detected. Are

 you sure this is what you want?". The significant speedup suggests the setup of the GPUs should be correct. </div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

However, the speedup is not observed for the gamma-point only calculation, and it worsens the performance when using 2 GPUs compared to 1 GPU. </div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

> Once you are sure that the code runs the way it should, have a look at the time reports for 1, 2, 4 GPUs, with no OpenMP threads.</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="margin: 0px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

I ran the tests without an OpenMP thread, and they show similar trends to those with 72 OpenMP threads. Indeed, the OpenMPI threads does not help a lot.</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

GPU   |     72 Open MP threads  |  no OpenMP thread | </div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

1          |  11.9 secs                           |   13.8 secs     |</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

2          |    7.0 secs                           |   18.0 secs    |</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

4          |     9.7 secs                          |   10.9 secs    |</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

> Note that at Gamma point you have no "easy" parallelization levels to exploit.</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Yes. I use multiple GPUs mainly to avoid the "out of memory" issue, but it turn out that this can harm the performance in the gamma-point only case.</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Best,</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Xing</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Scientist, Paul Scherrer Institute (PSI)</div>

<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div id="appendonsend"></div>

<hr style="display:inline-block;width:98%" tabindex="-1">

<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Paolo Giannozzi <paolo.giannozzi@uniud.it><br>

<b>Sent:</b> Tuesday, May 13, 2025 7:58 PM<br>

<b>To:</b> Quantum ESPRESSO users Forum <users@lists.quantum-espresso.org>; Wang Xing <xing.wang@psi.ch><br>

<b>Subject:</b> Re: [QE-users] Poor GPU scaling for Gamma-point-only calculation on multiple GPUs</font>

<div> </div>

</div>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">

<div class="PlainText">On 13/05/2025 08:47, Wang Xing wrote:<br>

<br>

> #SBATCH --ntasks-per-node=4<br>

> #SBATCH --cpus-per-task=72<br>

<br>

72 OpenMP threads are too many, and in any case: is your compute node a <br>

4*72=288 CPU one?<br>

<br>

> srun pw.x -pd .true. -npool 1 -in aiida.in > aiida.out<br>

<br>

-pd has no effect for GPUs, I think<br>

<br>

> GPU acceleration is ACTIVE. 1 visible GPUs per MPI rank<br>

<br>

this doesn't look right to me: it should say 4 (but I don't know how <br>

reliable this message is)<br>

<br>

> GPU-aware MPI enabled<br>

> Message from routine print_cuda_info:<br>

>    High GPU oversubscription detected. Are you sure this is what you want?<br>

<br>

this also doesn't look right: it seems to indicate that all four MPI <br>

processes access a single GPU (but I don't know how reliable this <br>

message is)<br>

<br>

> Does anyone have experience optimizing Gamma-point-only calculations on <br>

> multiple GPUs? Is there a known bottleneck or best practice for using <br>

> multiple GPUs efficiently in such a case?<br>

<br>

there are two distinct aspects here:<br>

1. how to convince your machine to run the code the way it should (and <br>

not the way it shouldn't) and<br>

2. how to optimized the parallelization over GPUs.<br>

I can't say anything about the former point: it is a task for system <br>

administrators. Once you are sure that the code runs the way it should, <br>

have a look at the time reports for 1, 2, 4 GPUs, with no OpenMP <br>

threads. You should easily spot anomalies or bottlenecks. Note that at <br>

Gamma point you have no "easy" parallelization levels to exploit.<br>

<br>

Paolo<br>

<br>

> Any insights would be greatly appreciated.<br>

> Best,<br>

> Xing<br>

> <br>

> <br>

> _______________________________________________________________________________<br>

> The Quantum ESPRESSO Foundation stands in solidarity with all civilians worldwide who are victims of terrorism, military aggression, and indiscriminate warfare.<br>

> --------------------------------------------------------------------------------<br>

> Quantum ESPRESSO is supported by MaX (<a href="http://www.max-centre.eu">www.max-centre.eu</a>)<br>

> users mailing list users@lists.quantum-espresso.org<br>

> <a href="https://lists.quantum-espresso.org/mailman/listinfo/users">https://lists.quantum-espresso.org/mailman/listinfo/users</a><br>

<br>

-- <br>

Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,<br>

Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216<br>

<br>

</div>

</span></font></div>

</body>

</html>