[QE-users] efficient parallelization on a system without Infiniband

Wed May 27 16:26:49 CEST 2020

Hello,
How can I minimize inter-node MPI communication in a pw.x run? My
system doesn't have Infiniband and inter-node MPI can easily become
the bottleneck.
Let's say, I'm running a calculation with 4 k-points, on 4 nodes, with
56 MPI tasks per node. I would then use -npool 4 to create 4 pools for
the k-point parallelization. However, it seems that the
diagonalization is by default parallelized imperfectly (or isn't it?):
     Subspace diagonalization in iterative solution of the eigenvalue problem:
     one sub-group per band group will be used
     scalapack distributed-memory algorithm (size of sub-group:  7*  7 procs)
So far, speedup on 4 nodes vs 1 node is 3.26x. Is it normal or does it
look like it can be improved?

Best regards,

Michal Krompiec
Merck KGaA
Southampton, UK