[Pw_forum] abysmal parallel performance of the CP code

Axel Kohlmeyer axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Thu Sep 22 12:07:20 CEST 2005

On Wed, 21 Sep 2005, Konstantin Kudin wrote:

KK>  Hi,


KK>  I've done some parallel benchmarks for the CP code so I thought I'd
KK> share them with the rest of the group. The system we have is a cluster
KK> of dual Opterons 2.0 Ghz with 1Gbit ethernet.

please keep in mind, that for reasonable scaling with car-parrinello
MD you usually need a better interconnect than gigabit ethernet.

some time ago i summarized some tests results for the CPMD code at:

the general issues apply to the espresso CP codes as well as to CPMD.
with gigabit (or TCP/IP at that) you suffer most from the very high
latencies. this is especially bad for the all-to-all communications that
are needed for the FFTs, and more visibly for dual-cpu. for quad-cpu nodes
you also should be in the bandwidth limit. now, how much this becomes 
visible depends much on the size of the job. we have a cluster of 
dual-opteron 246 (2.0GHz) with two gigabit networks (data and MPI 
separately) and it usually does not pay to run jobs across more 
than 3-4 nodes. and even then you already 'waste' about 20% of your
cpu power. the only saving grace is the fact, that a better interconnect
will coste much more than one wasted node.

KK>  I looked at 2 different measures of time, CPU time, and wall time
KK> computed as the difference between "This run was started" and "This run
KK> was terminated". By the way, such wall time could probably be printed
KK> by the code directly to be readily available.

probably, but you can also get the number as easily by using 
the 'time' command to start the jobs.

KK>  The system is a reasonably sized simulation cell with 20 CP
KK> (electronic+ionic) steps total.
KK>  The compiler is IFC 9.0, GOTO library is for BLAS, and mpich 1.2.6
KK> used for the MPI. The CP version is the CVS from Aug. 20, 2005.
KK>  What is crazy is that even for 2 cpus sitting in the same box there is
KK> lots of cpu time just lost somewhere. The strange thing is that the
KK> quad we have at 2.2 Ghz seems to lose just as much wall time as 2 duals
KK> talking across the network. And note how 4 cpus are barely better than
KK> 2x compared to single cpu performance if the wall clock time is
KK> considered.

please check, whether your MPI library does use shared memory 
communication properly and that your kernel supports setting the
proper CPU and Memory affinity (and you set it). i have seen some 
numbers where this makes over 20% difference on a dual machine and 
i would expect it matters even more on quad machines.

KK>  I know Nicola Marzari has done some parallel benchmarks, but I do not
KK> think that wall times were being paid attention to ...
KK>  Kostya
KK> P.S. Any suggestions what might be going on here?

you also have to take into account, that when you are running
a gamma point only calculation, you are missing the most efficient
parallelization (across k-points) that helps running, e.g., pw.x
rather efficiently on 'not so high'-performance networks.

best regards,

KK> Ncpu	CPU time	Wall time
KK> 1	1h22m		1h24m
KK> 2	45m33.41s	57m13s
KK> 4	27m30.80s	44m21s
KK> 6	18m22.71s	43m18s
KK> 8	14m53.91s	45m56s
KK> 4(quad) 37m18.56s	45m32s
KK> __________________________________________________
KK> Do You Yahoo!?
KK> Tired of spam?  Yahoo! Mail has the best spam protection around 
KK> http://mail.yahoo.com 
KK> _______________________________________________
KK> Pw_forum mailing list
KK> Pw_forum at pwscf.org
KK> http://www.democritos.it/mailman/listinfo/pw_forum


Dr. Axel Kohlmeyer   e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Lehrstuhl fuer Theoretische Chemie          Phone: ++49 (0)234/32-26673
Ruhr-Universitaet Bochum - NC 03/53         Fax:   ++49 (0)234/32-14045
D-44780 Bochum  http://www.theochem.ruhr-uni-bochum.de/~axel.kohlmeyer/
If you make something idiot-proof, the universe creates a better idiot.

More information about the users mailing list