[QE-users] efficient parallelization on a system without Infiniband

Wed May 27 16:47:30 CEST 2020

3.26x seems possible to me. It can be caused by load imbalance in the
iterative solver among the 4 k-points.
Could you list the time in seconds with 1 node and 4 nodes? Those you used
to calculate 3.26x.
Could you also try diago_david_ndim=2 under "&ELECTRONS" and provide 1 and
4-node time in seconds?

In addition, you may try ELPA which usually gives better performance than
scalapack.

Thanks,
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory

On Wed, May 27, 2020 at 9:27 AM Michal Krompiec <michal.krompiec at gmail.com>
wrote:

> Hello,
> How can I minimize inter-node MPI communication in a pw.x run? My
> system doesn't have Infiniband and inter-node MPI can easily become
> the bottleneck.
> Let's say, I'm running a calculation with 4 k-points, on 4 nodes, with
> 56 MPI tasks per node. I would then use -npool 4 to create 4 pools for
> the k-point parallelization. However, it seems that the
> diagonalization is by default parallelized imperfectly (or isn't it?):
>      Subspace diagonalization in iterative solution of the eigenvalue
> problem:
>      one sub-group per band group will be used
>      scalapack distributed-memory algorithm (size of sub-group:  7*  7
> procs)
> So far, speedup on 4 nodes vs 1 node is 3.26x. Is it normal or does it
> look like it can be improved?
>
> Best regards,
>
> Michal Krompiec
> Merck KGaA
> Southampton, UK
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20200527/26f87380/attachment.html>