[QE-users] k-points parallelization very slow
Paolo Giannozzi
p.giannozzi at gmail.com
Fri Feb 12 09:58:52 CET 2021
Parallelization over k-points does very little communication but it is not
as effective as plane-wave parallelization in distributing memory. I also
noticed that on a typical multi-core processor the performances of k-point
parallelization are often less good than those of plane-wave
parallelization and sometimes much less good, for reasons that are not
completely clear to me.
A factor to be considered is how your machine distributes the pools across
the nodes: each of the 4 pools of 32 processors should stay on one of the
nodes, but I wouldn't be too sure that this is what is really happening.
In your test, there is an anomaly, though: most of the time of "c_bands"
(computing the band structure) should be spent in "cegterg" (iterative
diagonalization). With 4*8 processors:
c_bands : 14153.20s CPU 14557.65s WALL ( 461 calls)
Called by c_bands:
init_us_2 : 102.63s CPU 105.55s WALL ( 1952 calls)
cegterg : 12700.70s CPU 13083.44s WALL ( 943 calls)
only 10% of the time is spent somewhere else, while with 4*32 processors:
c_bands : 18068.08s CPU 18219.06s WALL ( 454 calls)
Called by c_bands:
init_us_2 : 26.53s CPU 27.06s WALL ( 1924 calls)
cegterg : 2422.03s CPU 2451.72s WALL
75% of the time is not accounted for.
Paolo
On Fri, Feb 12, 2021 at 5:01 AM Christoph Wolf <wolf.christoph at qns.science>
wrote:
> Dear all,
>
> I tested k-point parallelization and I wonder if the following results can
> be normal or if my cluster has some serious problems...
>
> the system has 74 atoms and a 2x2x1 k-point grid resulting in 4 k-points
>
> number of k points= 4 Fermi-Dirac smearing, width (Ry)= 0.0050
> cart. coord. in units 2pi/alat
> k( 1) = ( 0.0000000 0.0000000 0.0000000), wk = 0.2500000
> k( 2) = ( 0.3535534 -0.3535534 0.0000000), wk = 0.2500000
> k( 3) = ( 0.0000000 -0.7071068 0.0000000), wk = 0.2500000
> k( 4) = ( -0.3535534 -0.3535534 0.0000000), wk = 0.2500000
>
>
> 1) run on 1 node x 32 CPUs with -nk 4
> Parallel version (MPI), running on 32 processors
>
> MPI processes distributed on 1 nodes
> K-points division: npool = 4
> R & G space division: proc/nbgrp/npool/nimage = 8
> Fft bands division: nmany = 1
>
> PWSCF : 5h42m CPU 6h 3m WALL
>
>
> 2) run on 4 nodes x 32 CPUs with -nk 4
> Parallel version (MPI), running on 128 processors
>
> MPI processes distributed on 4 nodes
> K-points division: npool = 4
> R & G space division: proc/nbgrp/npool/nimage = 32
> Fft bands division: nmany = 1
>
> PWSCF : 6h32m CPU 6h36m WALL
>
> I compiled my pwscf with intel 19 MKL, MPI and OpenMP. If I understood
> correctly, -nk parallelization should work well as there is not much
> communication between nodes but this does not seem to work for me at all...
> detailed timing logs are attached!
>
> TIA!
> Chris
>
> --
> IBS Center for Quantum Nanoscience
> Seoul, South Korea
>
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 206, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210212/87b78b5e/attachment.html>
More information about the users
mailing list