[Pw_forum] why k point parallelization -npool is so slow?
Paolo Giannozzi
p.giannozzi at gmail.com
Wed Sep 20 09:10:54 CEST 2017
It seems to me that scaling is quite good up to 16-20 processors for
plane-wave parallelization. It is not easy to obtain better results. The
effectiveness of k-point parallelization depends a lot on how much
k-point-independent parts of the code weigh on the overall performances. In
this specific case, k-point parallelization is not as good as it could be.
Improving it requires to work on the code.
Paolo
On Mon, Sep 18, 2017 at 12:02 PM, balabi <balabi at qq.com> wrote:
> Dear Paolo,
> Thank you so much for reply.
> Sorry for my previous unclear post. I will try to make my statement
> clear in this post.
> At the end of this post, I attached my scf.in file.
>
> First, I run scf for different mpi number like this
> mpiexec.hydra -n ${mpinum} pw.x -in scf.in > scf.out
>
> And then I collected all the timing result in the end of scf.out for
> different mpi number
>
> 1-> PWSCF : 3m47.62s CPU 3m54.05s WALL
> 4-> PWSCF : 56.51s CPU 57.83s WALL
> 8-> PWSCF : 31.30s CPU 32.78s WALL
> 12-> PWSCF : 24.21s CPU 25.06s WALL
> 16-> PWSCF : 17.67s CPU 18.60s WALL
> 20-> PWSCF : 14.03s CPU 15.26s WALL
> 24-> PWSCF : 13.53s CPU 14.44s WALL
> 25-> PWSCF : 12.13s CPU 14.05s WALL
> 28-> PWSCF : 11.80s CPU 12.69s WALL
> 32-> PWSCF : 13.45s CPU 16.12s WALL
>
> cpu time vs mpi num plot is here : https://pasteboard.co/GKUXhL4.png
> then I define, total cpu time = cpu_time x mpi_num, for example, for 32
> mpinum result, total cpu time is 32x13.45s=430.4s
> total cpu time vs mpi num plot is here : https://pasteboard.co/GKUYkD4.png
> We can see that the scaling is not good. A perfect linear scaling should
> be a horizontal line, am I right?
>
> So I thought maybe add k point parallelization will have better scaling.
> So I tried three case below, since there are 10 kpoints
>
> mpiexec.hydra -n 30 pw.x -npool 2 -in scf.in > scf.out
> mpiexec.hydra -n 30 pw.x -npool 5 -in scf.in > scf.out
> mpiexec.hydra -n 30 pw.x -npool 10 -in scf.in > scf.out
>
> The timing result is
> -npool 2 -> PWSCF : 14.89s CPU 15.88s WALL
> -npool 5 -> PWSCF : 27.45s CPU 28.95s WALL
> -npool 10 -> PWSCF : 0m53.52s CPU 1m 8.13s WALL
>
> Clearly, the scaling is extremely worse with npool parallelization. So
> what is wrong?
>
> best regards
>
> -----------------
> below is scf.in file
>
> &CONTROL
> prefix='bi2se3_mpi',
> calculation='scf',
> restart_mode='from_scratch',
> wf_collect=.true.,
> verbosity='high',
> tstress=.true.,
> tprnfor=.true.,
> forc_conv_thr=1d-4,
> outdir='./qe_tmpdir',
> pseudo_dir = './pseudo',
> /
> &SYSTEM
> ibrav = 5,
> celldm(1)=18.59579532204d0,celldm(4)=0.9113725833268d0,
> nat = 5,ntyp = 3,
> ecutwfc = 40,ecutrho = 433,
> /
> &ELECTRONS
> conv_thr = 1.0d-10,
> /
> &IONS
> /
> &CELL
> press_conv_thr=0.1d0
> cell_dofree='all',
> /
> ATOMIC_SPECIES
> Bi 208.98040 Bi.pbe-dn-kjpaw_psl.0.2.2.UPF
> Se1 78.971 Se.pbe-n-kjpaw_psl.0.2.UPF
> Se2 78.971 Se.pbe-n-kjpaw_psl.0.2.UPF
> ATOMIC_POSITIONS crystal
> Bi 0.4008d0 0.4008d0 0.4008d0
> Bi 0.5992d0 0.5992d0 0.5992d0
> Se2 0.2117d0 0.2117d0 0.2117d0
> Se2 0.7883d0 0.7883d0 0.7883d0
> Se1 0.d0 0.d0 0.d0
> K_POINTS automatic
> 4 4 4 1 1 1
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
>
--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20170920/e6acba9f/attachment.html>
More information about the users
mailing list