[QE-users] efficient parallelization on a system without Infiniband

Wed May 27 21:01:22 CEST 2020

Dear Ye, Dear Paolo,
I re-ran the benchmarks for my test case: a single MD step of a smallish
supercell of a certain oxide semiconductor, with PBE and PAW (from PSlib).
Previous timings were from the start of MD run until the end of the 1st SCF
iteration of the 2nd MD step.

Interestingly, ELPA gave no advantage over ScaLAPACK, and
diago_david_ndim=2 made things significantly slower.
The ScaLAPACK build is QE 6.5, the ELPA build is the development version
from last month. Both compiled with Intel 2020 and Intel MPI.

Here are the numbers:

MPI per node npool nodes ELPA/Scalapack diago_david_ndim time / s speedup
vs 1 node
56 4 1 ELPA 4 1335
56 4 1 ELPA 2 1931
56 4 1 ScaLAPACK 4 976
56 4 1 ScaLAPACK 2 1486
56 4 4 ELPA 4 367 3.637602
56 4 4 ELPA 2 729 2.648834
56 4 4 ScaLAPACK 4 357 2.733894
56 4 4 ScaLAPACK 2 555 2.677477
Best,
Michal

On Wed, 27 May 2020 at 15:47, Ye Luo <xw111luoye at gmail.com> wrote:

> 3.26x seems possible to me. It can be caused by load imbalance in the
> iterative solver among the 4 k-points.
> Could you list the time in seconds with 1 node and 4 nodes? Those you used
> to calculate 3.26x.
> Could you also try diago_david_ndim=2 under "&ELECTRONS" and provide 1 and
> 4-node time in seconds?
>
> In addition, you may try ELPA which usually gives better performance than
> scalapack.
>
> Thanks,
> Ye
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
>
> On Wed, May 27, 2020 at 9:27 AM Michal Krompiec <michal.krompiec at gmail.com>
> wrote:
>
>> Hello,
>> How can I minimize inter-node MPI communication in a pw.x run? My
>> system doesn't have Infiniband and inter-node MPI can easily become
>> the bottleneck.
>> Let's say, I'm running a calculation with 4 k-points, on 4 nodes, with
>> 56 MPI tasks per node. I would then use -npool 4 to create 4 pools for
>> the k-point parallelization. However, it seems that the
>> diagonalization is by default parallelized imperfectly (or isn't it?):
>>      Subspace diagonalization in iterative solution of the eigenvalue
>> problem:
>>      one sub-group per band group will be used
>>      scalapack distributed-memory algorithm (size of sub-group:  7*  7
>> procs)
>> So far, speedup on 4 nodes vs 1 node is 3.26x. Is it normal or does it
>> look like it can be improved?
>>
>> Best regards,
>>
>> Michal Krompiec
>> Merck KGaA
>> Southampton, UK
>> _______________________________________________
>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
>> users mailing list users at lists.quantum-espresso.org
>> https://lists.quantum-espresso.org/mailman/listinfo/users
>>
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20200527/6fd3d0be/attachment.html>