[Pw_forum] timing vs QE version vs pools
Giovanni Cantele
Giovanni.Cantele at na.infn.it
Fri Mar 6 18:15:26 CET 2009
Dear all,
I just finished some timing tests relative to QE, stimulated by recent
discussions in this forum.
I would like to share these tests, and ask a couple of simple (?)
questions.
Up to now, even if not supported by systematic tests, my feeling has
been that
for the kind of runs I'm doing on the kind of cluster(s) I'm using, the
following rules hold:
i) if using a relatively "small" number of CPUs, pools are rather
ineffective, I often realized that
using pools actually increased job execution time
ii) diagonalization parallelism (-ndiag XX) is ineffective as well
iii) version 3.2.3 is "better" than 4.0.4
Bearing in mind that I well understand that this kind of issues are
often related to the well known problem
arising between the keyboard and the chair, these are my results:
- system: graphene, 4x4 unit cell (only C atoms)
ibrav =
4
celldm(1) =
18.6
celldm(3) =
1.36250
nat =
32
ntyp =
1
ecutwfc =
30.D0
ecutrho =
180.D0
nosym = .true.
FFT grid: ( 80, 80,120)
smooth grid: ( 72, 72, 90)
- computational resources: cluster with dual-core two-processor nodes
each processor is an Intel(R) Xeon(R) CPU 5160 @ 3.00GHz
Infiniband connection between nodes
compiler: ifort (INTEL) 10.1.012
libraries: mkl 10.0.1.014
My tests concerned all possible combinations of the following parameters:
- # k-points: 32 or 64
- QE version: 3.2.3 or 4.0.4
- ndiag: default or 1 (only for 4.0.4)
- ncpus: 16 or 32
- npools: 1, 2, 4, 8 (just in few cases 16)
-------------------------------------------------------------------- 64
k points
npools CPU time wall
time
-------------------------------------- ncpus: 16 nodes: 4 QE
3.2.3
1 8m49.79s 9m
5.89s
2 9m45.15s
9m59.12s
4 12m59.18s
13m14.34s
8 16m 2.94s
16m44.26s
-------------------------------------- ncpus: 16 nodes: 4 QE
4.0.4
1 10m51.52s
11m15.80s
2 10m20.63s
10m43.21s
4 13m21.63s
13m45.10s
8 15m59.51s
16m54.57s
1 9m26.57s 9m41.97s -ndiag 1
2 10m22.68s 10m37.74s
4 13m 6.11s 13m25.57s
8 16m 3.49s 17m 3.31s
-------------------------------------- ncpus: 32 nodes: 8 QE 3.2.3
1 15m36.93s 15m42.65s
2 5m30.32s 5m37.54s
4 5m 7.28s 5m14.20s
8 6m54.27s 7m 1.82s
16 9m 6.85s 9m28.26s
-------------------------------------- ncpus: 32 nodes: 8 QE 4.0.4
1 17m38.71s 17m57.28s
2 9m 7.36s 9m21.41s
4 5m31.61s 5m43.00s
8 7m 1.58s 7m13.86s
16 9m 3.48s 9m33.50s
1 16m 9.52s 16m18.73s -ndiag 1
2 5m49.49s 5m58.75s
4 5m21.05s 5m29.89s
8 6m50.59s 7m 1.20s
16 8m58.71s 9m29.50s
--------------------------------------
--------------------------------------------------------------------
-------------------------------------------------------------------- 32
k points
npools CPU time wall
time
-------------------------------------- ncpus: 16 nodes: 4 QE
3.2.3
1 4m44.63s
4m53.08s
2 5m11.63s
5m19.66s
4 7m 8.90s
7m17.45s
8 8m56.42s
9m18.75s
-------------------------------------- ncpus: 16 nodes: 4 QE
4.0.4
1 5m54.59s 6m
8.06s
2 5m46.40s 5m56.81s
4 7m15.19s 7m28.52s
8 8m58.10s 9m26.16s
1 5m 3.20s 5m11.53s -ndiag 1
2 5m13.45s 5m23.68s
4 7m 2.97s 7m14.12s
8 8m55.15s 9m24.76s
-------------------------------------- ncpus: 32 nodes: 8 QE 3.2.3
1 3m24.75s 3m28.51s
2 2m34.29s 2m38.63s
4 2m57.55s 3m 2.09s
8 4m14.11s 4m19.09s
-------------------------------------- ncpus: 32 nodes: 8 QE 4.0.4
1 4m36.23s 4m46.02s
2 3m14.38s 3m20.97s
4 3m18.11s 3m23.60s
8 4m26.24s 4m32.57s
1 3m31.81s 3m37.10s -ndiag 1
2 2m45.79s 2m50.33s
4 3m 2.72s 3m 7.82s
8 4m18.45s 4m24.00s
--------------------------------------
--------------------------------------------------------------------
comments:
- using pools is, in this case, effective only if ncpus=32, I attribute
this to the fact that the FFT is the true bottleneck,
thus one should first satisfy FFT grid parallelization. However, from 16
to 32 nodes FFT parallelization is still effective
(maybe due to conflict of cores in the same node if using too large
resources on that node for ncpus=16), pool parallelization
is effective only with -npool 2
- it seems that QE 3.2.3 always performs a little bit better than 4.0.4,
any hint on what (if any) is wrong in what I'm doing?
- it seems that -ndiag 1 (serial algorithm for the iterative solution of
the eigenvalue problem) always performs a little
bit better than the default (code) choice. I attribute this to the fact
that only for VERY LARGE number of electrons
this may give a difference, is that right? Actually the code "default"
would lead to this kind of parallelization also
for these runs.
Thank a lot for reading this message....
Giovanni
--
Dr. Giovanni Cantele
Coherentia CNR-INFM and Dipartimento di Scienze Fisiche
Universita' di Napoli "Federico II"
Complesso Universitario di Monte S. Angelo - Ed. 6
Via Cintia, I-80126, Napoli, Italy
Phone: +39 081 676910
Fax: +39 081 676346
E-mail: giovanni.cantele at cnr.it
giovanni.cantele at na.infn.it
Web: http://people.na.infn.it/~cantele
Research Group: http://www.nanomat.unina.it
More information about the users
mailing list