[Pw_forum] Use of pool

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Tue Feb 24 16:20:11 CET 2009


On Tue, 24 Feb 2009, Gabriele Sclauzero wrote:

dear gabriele,

GS> Contrary to what Axel usually says (but my experience is far little 
GS> than his, so he is in the position to blame me) I do not believe 
GS> that increasing the number of pools always gives a better timing for 
GS> a given system. It depends on the system you are computing (of 
GS> course...): how big is your supercell, how many electrons, how many 
GS> k-points, how many bands...

there are _no_ absolute truths in benchmarking and optimizing performance. 
remember the saying: there are lies, damn lies, and benchmarks. ;)
my experience is mostly based on using 2x single processor nodes.
the recent change to multi-(soon many) core cpus with large caches
but low memory bandwidth will change "the rules".

i still remember the case of some benchmarks from eduardo, where the
threading in MKL (and using no MPI) was more efficient than MPI with
an unthreaded MKL.

it will become more important to minimize memory and communication 
bandwidth and control channels of communication carefully. 

also, sometimes the more efficient algorithm in theory, may not 
be the best solution in practice, if it doesn't parallelize well
or has the wrong scaling with system size.

to give an example: at the last ICTP HPC school a few participants 
made some tests on multi-threading a simple classical MD code. 
here is some pseudo code (with f/force being vectors in x,y,z).

do i=1,n-1
   do j=i+,n
     force = compute_force(i,j)
     f(i)  = f(i) + force
     f(j)  = f(j) - force
   end do
end do

do i=1,n
   do j=1,n
     f(i) = f(i) + compute_force(i,j)
   end do
end do

the first version should be twice as efficicent,
but once you use threading, you have problems with
load imbalance (the amount of work in the innerloop
changes) and overlapping data access to f(j),
which require either using mutexes or caching of
intermediate results and thus extra computations.

once we went to >8 threads the second version wins,
because it has less overhead and you can use static
scheduling (which improves data locality and thus
cache efficiency), despite it being in theory much
less efficient.

of course there are even better ways to handle this,
but it is hopefully a simple enough example to see
my point.

GS> My experience is that, when working with big supercells is better 
GS> try to use one pool until you have a decent scaling and THEN start 
GS> using more pools. If you have a smaller cell (and consequently many 
GS> more k-points) and not many electrons (like a slab geometry with 
GS> small periodicity), then using pools would be more beneficial.

lets put it this way. using pools should help the most when you scaling
is determined by the communication. with pw.x you have to always 
consider a second important contribution to performance: disk i/o.
one easy way to check this would be to run an equivalently sized input
in cp.x, which does not write to the disk until a restart is written.
with a multi/many core cpu node, concurrent disk access can be a
big problem. using pools makes it worse, since you need more memory
and this reduces the amount of memory available to the disk cache.
experimenting with disk_io settings (e.g. 'low' or 'none') using
a non-NFS scratch partition can have a significant impact. 

the rule of the thumb, that "npools are almost always better" is also
based on the fact that using multiple nodes means multiple independent
scratch partitions which is essentially equivalent to raid-0 scenario
(actually even better as you have independent i/o busses, too). this is
no longer true on a single node with quad-cores.

a final remark on AMD cpus. those need special care for getting maximum
performance. since you have a NUMA architecture and effectively each
CPU has its _own_ memory that can be "borrowed" to the other, it is
very important for good performance to enable both memory and processor
affinity. but again, that also has implication of how much memory is
available to jobs. ...and in some cases it may be even more efficient
on quad-core cpus, to not use all cores, but only half of them.
especially on intel core2 architecture since pairs of two cores share
a L2 cache and with an overloaded memory bus, you gain more from
doubling the cache than from doubling the number of (local) cpus.

HTH,
   axel.

GS> However these timings may also depend on the configuration of your machine, i.e. how 
GS> optimized are algebra and fft libraries and how fast are communications.
GS> 
GS> To have a more precise idea of what's going on you should have a detailed look at the 
GS> timings at the end of pw output.
GS> 
GS> Last thing, you should be careful that increasing the number of pools you're increasing 
GS> the memory request on your node (as well as the memory traffic on RAM and caches, I 
GS> suppose), and that may also be a severe bottleneck to performance.
GS> 
GS> Regards
GS> 
GS> GS
GS> 
GS> Huiqun Zhou wrote:
GS> > Dear list users:
GS> > 
GS> > I happened to test duration times of calculating the system I'm 
GS> > investigating against number of pools used. There are totally
GS> > 36 k points. But the results surprised me quite a lot. 
GS> > 
GS> > no pool:  6m21.02s CPU time,     6m45.88s wall time
GS> > 2 pools:  7m19.39s CPU time,     7m38.99s wall time
GS> > 4 pools: 11m59.09s CPU time,    12m14.66s wall time
GS> > 8 pools: 21m28.77s CPU time,    21m38.71s wall time
GS> > 
GS> > The machine I'm using is an AMD box with 2 quad core shanghai.
GS> > 
GS> > Is my understanding of usage of pool wrong?
GS> > 
GS> > Huiqun Zhou
GS> > @Nanjing University, China
GS> > _______________________________________________
GS> > Pw_forum mailing list
GS> > Pw_forum at pwscf.org
GS> > http://www.democritos.it/mailman/listinfo/pw_forum
GS> > 
GS> 
GS> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.



More information about the users mailing list