<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<meta name="Generator" content="Microsoft Word 15 (filtered medium)">

<style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;

        mso-fareast-language:EN-US;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:#0563C1;

        text-decoration:underline;}

span.EmailStyle17

        {mso-style-type:personal-compose;

        font-family:"Calibri",sans-serif;

        color:windowtext;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-family:"Calibri",sans-serif;

        mso-fareast-language:EN-US;}

@page WordSection1

        {size:612.0pt 792.0pt;

        margin:72.0pt 72.0pt 72.0pt 72.0pt;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

</head>

<body lang="en-BE" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">

<div class="WordSection1">

<p style="margin:0cm"><span lang="EN-US">Dear QE users,<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US"> <o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">I need some help in optimizing the different parallelization levels of my QE calculations. Unfortunately, our HPC center is going to start billing the research groups for our calculations so I'm currently working on

 making our QE calculations as efficient as possible to avoid large bills at the end of the year. In our HPC center we have two clusters at our disposal with different architectures and different billing amounts, so I wanted to figure out which one to use and

 how many nodes to request per calculation. <o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US"> <o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">The two clusters have the following architecture:<br>

- Cluster 1 (Leibniz): 152 compute nodes containing 2 Xeon E5-2680v4 <a href="mailto:CPUs@2.4GHz">

CPUs@2.4GHz</a> (Broadwell), 14 cores each (28 cores per node in total)<br>

- Cluster 2 (Vaughan): 152 compute nodes containing 2 AMD Epyc 7452 <a href="mailto:CPUs@2.35">

CPUs@2.35</a> GHz (Rome), 32 cores each (64 cores per node in total)<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US"> <o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">From the QE documentation I've read that there are only several parameters that are important for the parallelization. These parameters are given below with their values for my systems:<br>

- No. of k-points = 2<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">- 3rd dimension in the smooth FFT grid = 405<br>

- 3rd dimension in the dense FFT grid = 720<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">- No. of KS states = 457<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US"> <o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">I am currently using rather arbitrary parallelization settings as I just request 8 nodes (of 28 cores) per calculation with k-point parallelization set to 2 (i.e., -nk 2) and using the serial algorithm for subspace diagonalization

 (i.e., -nd 1) to make the calculation complete within a reasonable timescale.<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US"> <o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">I've already read a lot about the parallelization implemented in QE, but I still have several questions relating to the different levels of parallelization:<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US"> <o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">k-point parallelization:<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">From what I understand, having only 2 k-points in my calculations means that I can maximally subdivide the processors into a set of 2 pools as it cannot exceed the number of k-points, so that each pool of processors

 handles a single k-point. If I would take more pools, this would be detrimental to performance as multiple pools would handle a single k-point resulting in heavy communications between these pools. Therefore, I am wondering if it is also bad to request more

 than 2 nodes for my calculations considering I only have two k-points and subdivide my processors into 2 pools? Requesting more than 2 nodes would mean every pool contains processors spread across multiple nodes, so each pool would require inter-node communications

 to do its computations which would slow down the calculation.<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US"> <o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">FFT parallelization:<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">It is stated in the documentation of pw.x that the parallelization on PWs yields best results when the number of processors in a pool is a divisor of the 3rd dimension of the smooth (nr3s) and dense (nr3) FFT grids.

 Unfortunately, in my case the greatest common divisor of both dimensions is 45 which is a bad match with the number of processors available on the nodes on either of the clusters available to me (28 and 64 respectively). Therefore, I was wondering if it is

 okay to just manually alter the third dimensions with nr3=X and nr3s=X to make sure that the number of processors available are a common divisor of the third dimensions of the FFT grids?<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US"> <o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">Bands and tasks parallelization:<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">If I'm not mistaken, I should not use band or task group parallelization because bands parallelization is only useful when using hybrid functionals (which I don't use) and task group parallelization is only necessary

 when the number of processors exceeds the number of FFT planes (which is not the case here unless I ask an excessive number of nodes which would already be detrimental due to the inter-node communications).<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US"> <o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">OpenMP parallelization:<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">OpenMP cannot be used to coordinate multiple node jobs, so if I would want to use this level of parallelization, I would have to make sure that the number of processors in a pool is lower than or equal to the number

 of processors on a single node right?<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US"> <o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">Any help on the subject would be greatly appreciated!<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US"> <o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">Thanks in advance,<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">Léon Luntadila Lufungula<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">Structural Chemistry Group<o:p></o:p></span></p>

<p style="margin:0cm"><span lang="EN-US">University of Antwerp, Belgium<o:p></o:p></span></p>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

</body>

</html>