[QE-users] k-points parallelization very slow

Fri Feb 12 09:58:52 CET 2021

Parallelization over k-points does very little communication but it is not
as effective as plane-wave parallelization in distributing memory. I also
noticed that on a typical multi-core processor the performances of k-point
parallelization are often less good than those of plane-wave
parallelization and sometimes much less good, for reasons that are not
completely clear to me.

A factor to be considered is how your machine distributes the pools across
the nodes: each of the 4 pools of 32 processors should stay on one of the
nodes, but I wouldn't be too sure that this is what is really happening.

In your test, there is an anomaly, though: most of the time of "c_bands"
(computing the band structure) should be spent in "cegterg" (iterative
diagonalization). With 4*8 processors:
     c_bands      :  14153.20s CPU  14557.65s WALL (     461 calls)
     Called by c_bands:
     init_us_2    :    102.63s CPU    105.55s WALL (    1952 calls)
     cegterg      :  12700.70s CPU  13083.44s WALL (     943 calls)
only 10% of the time is spent somewhere else,  while with 4*32 processors:
     c_bands      :  18068.08s CPU  18219.06s WALL (     454 calls)
     Called by c_bands:
     init_us_2    :     26.53s CPU     27.06s WALL (    1924 calls)
     cegterg      :   2422.03s CPU   2451.72s WALL
75% of the time is not accounted for.

Paolo

On Fri, Feb 12, 2021 at 5:01 AM Christoph Wolf <wolf.christoph at qns.science>
wrote:

> Dear all,
>
> I tested k-point parallelization and I wonder if the following results can
> be normal or if my cluster has some serious problems...
>
> the system has 74 atoms and a 2x2x1 k-point grid resulting in 4 k-points
>
>      number of k points=     4  Fermi-Dirac smearing, width (Ry)=  0.0050
>                        cart. coord. in units 2pi/alat
>         k(    1) = (   0.0000000   0.0000000   0.0000000), wk =   0.2500000
>         k(    2) = (   0.3535534  -0.3535534   0.0000000), wk =   0.2500000
>         k(    3) = (   0.0000000  -0.7071068   0.0000000), wk =   0.2500000
>         k(    4) = (  -0.3535534  -0.3535534   0.0000000), wk =   0.2500000
>
>
> 1) run on 1 node x 32 CPUs with -nk 4
>      Parallel version (MPI), running on    32 processors
>
>      MPI processes distributed on     1 nodes
>      K-points division:     npool     =       4
>      R & G space division:  proc/nbgrp/npool/nimage =       8
>      Fft bands division:     nmany     =       1
>
>      PWSCF        :      5h42m CPU      6h 3m WALL
>
>
> 2) run on 4 nodes x 32 CPUs with -nk 4
>      Parallel version (MPI), running on   128 processors
>
>      MPI processes distributed on     4 nodes
>      K-points division:     npool     =       4
>      R & G space division:  proc/nbgrp/npool/nimage =      32
>      Fft bands division:     nmany     =       1
>
>      PWSCF        :      6h32m CPU      6h36m WALL
>
> I compiled my pwscf with intel 19 MKL, MPI and OpenMP. If I understood
> correctly, -nk parallelization should work well as there is not much
> communication between nodes but this does not seem to work for me at all...
> detailed timing logs are attached!
>
> TIA!
> Chris
>
> --
> IBS Center for Quantum Nanoscience
> Seoul, South Korea
>
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users

-- 
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 206, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210212/87b78b5e/attachment.html>