[Pw_forum] Use of pool

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Wed Mar 4 03:51:22 CET 2009

On Tue, Feb 24, 2009 at 1:45 AM, Huiqun Zhou <hqzhou at nju.edu.cn> wrote:
> Dear list users:

hi all,

> I happened to test duration times of calculating the system I'm
> investigating against number of pools used. There are totally
> 36 k points. But the results surprised me quite a lot.
> no pool:  6m21.02s CPU time,     6m45.88s wall time
> 2 pools:  7m19.39s CPU time,     7m38.99s wall time
> 4 pools: 11m59.09s CPU time,    12m14.66s wall time
> 8 pools: 21m28.77s CPU time,    21m38.71s wall time
> The machine I'm using is an AMD box with 2 quad core shanghai.
> Is my understanding of usage of pool wrong?

sorry for replying to an old mail in this thread, but it has the
proper times to compare to. the input you sent me, does not
seem to be the exactly the same as the one you used for the
benchmarks (rather a bit larger). but i reduced the number of
k-points to yield 36 and have some numbers here.
this is on dual intel quad core E5430 @ 2.66GHz cpus with 8GB DDR2 ram.
i also modified the input to set wfcdir to use the local scratch rather than my
working directory (as this is on an NFS server) and test with
disk_io='high' and 'low'.
on a single node (always with 8 MPI tasks) i get:

1node-1pools-high.out:     PWSCF        : 18m55.62s CPU time,    26m
7.20s wall time
1node-2pools-high.out:     PWSCF        : 14m46.03s CPU time,    18m
0.26s wall time
1node-4pools-high.out:     PWSCF        : 14m 5.27s CPU time,
16m44.03s wall time
1node-8pools-high.out:     PWSCF        : 32m29.71s CPU time,    35m
0.35s wall time

1node-1pools-low.out:     PWSCF        : 18m36.88s CPU time,
19m24.71s wall time
1node-2pools-low.out:     PWSCF        : 15m 0.98s CPU time,
15m42.56s wall time
1node-4pools-low.out:     PWSCF        : 14m 6.97s CPU time,
14m55.57s wall time
1node-8pools-low.out:     PWSCF        : 31m51.68s CPU time,
32m46.77s wall time

so the result is not quite as drastic, but with 8 pools on the node,
the machine is suffering.
one can also see that disk_io='low' is helping to reduce waiting time
(disk_io='high' still
writes files into the working directory, which is on slow NFS). so for
my machine it looks
as if 4 pools is the optimal compromise. to further investigate
whether pools or gspace
parallelization is more efficient i then started to run the same job
across multiple nodes.
this uses only 4 cores per node, i.e. the total number of mpi tasks is still 8.

2node-1pools-high.out:     PWSCF        : 12m 0.88s CPU time,
17m42.01s wall time
2node-2pools-high.out:     PWSCF        :  8m42.96s CPU time,
11m44.88s wall time
2node-4pools-high.out:     PWSCF        :  6m26.72s CPU time,
8m54.83s wall time
2node-8pools-high.out:     PWSCF        : 12m47.61s CPU time,
15m18.67s wall time

2node-1pools-low.out:     PWSCF        : 10m53.87s CPU time,
11m35.94s wall time
2node-2pools-low.out:     PWSCF        :  8m37.37s CPU time,
9m23.17s wall time
2node-4pools-low.out:     PWSCF        :  6m22.87s CPU time,
7m11.22s wall time
2node-8pools-low.out:     PWSCF        : 13m 7.30s CPU time,
13m57.71s wall time

in the next test, i doubled the number of nodes again, but this time
kept 4 mpi tasks per node,
also i'm only using disk_io='low'.

4node-4pools-low.out:     PWSCF        :  4m52.92s CPU time,
5m38.90s wall time
4node-8pools-low.out:     PWSCF        :  4m29.73s CPU time,
5m17.86s wall time

interesting, now the striking difference between 4 pools and 8 pools
is gone. since i
doubled the number of nodes, the memory consumption per mpi task in the 8 pools
case should have dropped to a similar level as in the 4 pools case with 2 nodes.
to confirm this, lets run the same job with 16 pools:

4node-16pools-low.out:     PWSCF        : 10m54.57s CPU time,
11m53.59s wall time

bingo! the only explanation for this is cache memory. so in this specific case,
up to about "half a wavefunction" memory consumption per node, the caching
of the cpu is much more effective. so the "more pools is better"-rule has to be
augmented by "unless it makes the cpu cache less efficient".

since 36 kpoints is wholly divisible by 6  but not by 8, now a test
with 6 nodes.

 6node-4pools-low.out:     PWSCF        :  3m41.65s CPU time,
4m25.15s wall time
 6node-6pools-low.out:     PWSCF        :  3m40.12s CPU time,
4m23.33s wall time
 6node-8pools-low.out:     PWSCF        :  3m14.13s CPU time,
3m57.76s wall time
6node-12pools-low.out:     PWSCF        :  3m37.96s CPU time,
4m25.91s wall time
6node-24pools-low.out:     PWSCF        : 10m55.18s CPU time,
11m47.87s wall time

so 6 pools is more efficient than 4, but 8 even more than 6 or 12,
which should lead to
a better distribution of the work. so the modified "rule" from above
seems to hold.
ok, can we get any faster. ~4min walltime for a 21 scf cycle single
point run is already pretty
good and the serial overhead (and  wf_collect=.true.) should kick in.
so now with 8 nodes
and 32 mpi tasks.

 8node-4pools-low.out:     PWSCF        :  3m22.02s CPU time,     4m
7.06s wall time
 8node-8pools-low.out:     PWSCF        :  3m14.52s CPU time,
3m58.86s wall time
8node-16pools-low.out:     PWSCF        :  3m36.18s CPU time,
4m24.21s wall time

hmmm, not much better, but now for the final test. since we have 36
k-points and
we need at least two mpi tasks per pool to get good performance, lets
try 18 nodes
with 4 mpi tasks each:

 18node-9pools-low.out:     PWSCF        :  1m57.06s CPU time,
3m37.31s wall time
18node-18pools-low.out:     PWSCF        :  2m 2.62s CPU time,
2m45.51s wall time
18node-36pools-low.out:     PWSCF        :  2m45.61s CPU time,
3m33.00s wall time

not spectacular scaling, but still improving. but it looks like
writing the final wavefunction
costs about 45 seconds or more, as indicated by the difference between
cpu and walltime.

at this level, you better not use disk_io='high', as that will put a
_severe_ disk load on
the machine that is carrying the working directory (particularly bad
for NFS servers),
the code will generate and continuously rewrite in this case 144 files...
and the walltime to cputime ratio quickly rises (a factor of 5 in my
case so i stopped the
job before the NFS server would die).

in summary, it is obviously getting more complicated to define a
"rule" of what gives
the best performance. some experimentation is always required and
sometimes there
will be surprises. i have not touched the issue of network speed (all
tests were done
across a 4xDDR infiniband network).

i hope this little benchmark excursion was as interesting and thought
provoking for you
as it was for me. thanks for everybody that gave their input to this discussion.


p.s.: perhaps at some point it might be interesting to organize a workshop on
"post-compilation optimization" for pw.x  for different types of jobs
and hardware.

> Huiqun Zhou
> @Nanjing University, China
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum

Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
  Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
If you make something idiot-proof, the universe creates a better idiot.

More information about the users mailing list