<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Dear Paolo,</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Thank you very much for your kind reply.</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
> 72 OpenMP threads are too many, and in any case: is your compute node a 4*72=288 CPU one?</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Yes, my compute node has 288 CPUs.</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
> 1. how to convince your machine to run the code the way it should (and not the way it shouldn't)</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
I tested the same slurm script for Si with 125 atoms with 5x5x5 kpoints, and with "-npool 4" for 4 GPUs, it is ~15x faster compared to running on MPI-only 128 CPUs. Very nice! Although I saw the same warning message, "High GPU oversubscription detected. Are
you sure this is what you want?". The significant speedup suggests the setup of the GPUs should be correct. </div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
However, the speedup is not observed for the gamma-point only calculation, and it worsens the performance when using 2 GPUs compared to 1 GPU. </div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
> Once you are sure that the code runs the way it should, have a look at the time reports for 1, 2, 4 GPUs, with no OpenMP threads.</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="margin: 0px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
I ran the tests without an OpenMP thread, and they show similar trends to those with 72 OpenMP threads. Indeed, the OpenMPI threads does not help a lot.</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
GPU | 72 Open MP threads | no OpenMP thread | </div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
1 | 11.9 secs | 13.8 secs |</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
2 | 7.0 secs | 18.0 secs |</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
4 | 9.7 secs | 10.9 secs |</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
> Note that at Gamma point you have no "easy" parallelization levels to exploit.</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Yes. I use multiple GPUs mainly to avoid the "out of memory" issue, but it turn out that this can harm the performance in the gamma-point only case.</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Best,</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Xing</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Scientist, Paul Scherrer Institute (PSI)</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Paolo Giannozzi <paolo.giannozzi@uniud.it><br>
<b>Sent:</b> Tuesday, May 13, 2025 7:58 PM<br>
<b>To:</b> Quantum ESPRESSO users Forum <users@lists.quantum-espresso.org>; Wang Xing <xing.wang@psi.ch><br>
<b>Subject:</b> Re: [QE-users] Poor GPU scaling for Gamma-point-only calculation on multiple GPUs</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">On 13/05/2025 08:47, Wang Xing wrote:<br>
<br>
> #SBATCH --ntasks-per-node=4<br>
> #SBATCH --cpus-per-task=72<br>
<br>
72 OpenMP threads are too many, and in any case: is your compute node a <br>
4*72=288 CPU one?<br>
<br>
> srun pw.x -pd .true. -npool 1 -in aiida.in > aiida.out<br>
<br>
-pd has no effect for GPUs, I think<br>
<br>
> GPU acceleration is ACTIVE. 1 visible GPUs per MPI rank<br>
<br>
this doesn't look right to me: it should say 4 (but I don't know how <br>
reliable this message is)<br>
<br>
> GPU-aware MPI enabled<br>
> Message from routine print_cuda_info:<br>
> High GPU oversubscription detected. Are you sure this is what you want?<br>
<br>
this also doesn't look right: it seems to indicate that all four MPI <br>
processes access a single GPU (but I don't know how reliable this <br>
message is)<br>
<br>
> Does anyone have experience optimizing Gamma-point-only calculations on <br>
> multiple GPUs? Is there a known bottleneck or best practice for using <br>
> multiple GPUs efficiently in such a case?<br>
<br>
there are two distinct aspects here:<br>
1. how to convince your machine to run the code the way it should (and <br>
not the way it shouldn't) and<br>
2. how to optimized the parallelization over GPUs.<br>
I can't say anything about the former point: it is a task for system <br>
administrators. Once you are sure that the code runs the way it should, <br>
have a look at the time reports for 1, 2, 4 GPUs, with no OpenMP <br>
threads. You should easily spot anomalies or bottlenecks. Note that at <br>
Gamma point you have no "easy" parallelization levels to exploit.<br>
<br>
Paolo<br>
<br>
> Any insights would be greatly appreciated.<br>
> Best,<br>
> Xing<br>
> <br>
> <br>
> _______________________________________________________________________________<br>
> The Quantum ESPRESSO Foundation stands in solidarity with all civilians worldwide who are victims of terrorism, military aggression, and indiscriminate warfare.<br>
> --------------------------------------------------------------------------------<br>
> Quantum ESPRESSO is supported by MaX (<a href="http://www.max-centre.eu">www.max-centre.eu</a>)<br>
> users mailing list users@lists.quantum-espresso.org<br>
> <a href="https://lists.quantum-espresso.org/mailman/listinfo/users">https://lists.quantum-espresso.org/mailman/listinfo/users</a><br>
<br>
-- <br>
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,<br>
Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216<br>
<br>
</div>
</span></font></div>
</body>
</html>