[Pw_forum] abysmal parallel performance of the CP code
axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Thu Sep 22 12:07:20 CEST 2005
On Wed, 21 Sep 2005, Konstantin Kudin wrote:
KK> I've done some parallel benchmarks for the CP code so I thought I'd
KK> share them with the rest of the group. The system we have is a cluster
KK> of dual Opterons 2.0 Ghz with 1Gbit ethernet.
please keep in mind, that for reasonable scaling with car-parrinello
MD you usually need a better interconnect than gigabit ethernet.
some time ago i summarized some tests results for the CPMD code at:
the general issues apply to the espresso CP codes as well as to CPMD.
with gigabit (or TCP/IP at that) you suffer most from the very high
latencies. this is especially bad for the all-to-all communications that
are needed for the FFTs, and more visibly for dual-cpu. for quad-cpu nodes
you also should be in the bandwidth limit. now, how much this becomes
visible depends much on the size of the job. we have a cluster of
dual-opteron 246 (2.0GHz) with two gigabit networks (data and MPI
separately) and it usually does not pay to run jobs across more
than 3-4 nodes. and even then you already 'waste' about 20% of your
cpu power. the only saving grace is the fact, that a better interconnect
will coste much more than one wasted node.
KK> I looked at 2 different measures of time, CPU time, and wall time
KK> computed as the difference between "This run was started" and "This run
KK> was terminated". By the way, such wall time could probably be printed
KK> by the code directly to be readily available.
probably, but you can also get the number as easily by using
the 'time' command to start the jobs.
KK> The system is a reasonably sized simulation cell with 20 CP
KK> (electronic+ionic) steps total.
KK> The compiler is IFC 9.0, GOTO library is for BLAS, and mpich 1.2.6
KK> used for the MPI. The CP version is the CVS from Aug. 20, 2005.
KK> What is crazy is that even for 2 cpus sitting in the same box there is
KK> lots of cpu time just lost somewhere. The strange thing is that the
KK> quad we have at 2.2 Ghz seems to lose just as much wall time as 2 duals
KK> talking across the network. And note how 4 cpus are barely better than
KK> 2x compared to single cpu performance if the wall clock time is
please check, whether your MPI library does use shared memory
communication properly and that your kernel supports setting the
proper CPU and Memory affinity (and you set it). i have seen some
numbers where this makes over 20% difference on a dual machine and
i would expect it matters even more on quad machines.
KK> I know Nicola Marzari has done some parallel benchmarks, but I do not
KK> think that wall times were being paid attention to ...
KK> P.S. Any suggestions what might be going on here?
you also have to take into account, that when you are running
a gamma point only calculation, you are missing the most efficient
parallelization (across k-points) that helps running, e.g., pw.x
rather efficiently on 'not so high'-performance networks.
KK> Ncpu CPU time Wall time
KK> 1 1h22m 1h24m
KK> 2 45m33.41s 57m13s
KK> 4 27m30.80s 44m21s
KK> 6 18m22.71s 43m18s
KK> 8 14m53.91s 45m56s
KK> 4(quad) 37m18.56s 45m32s
KK> Do You Yahoo!?
KK> Tired of spam? Yahoo! Mail has the best spam protection around
KK> Pw_forum mailing list
KK> Pw_forum at pwscf.org
Dr. Axel Kohlmeyer e-mail: axel.kohlmeyer at theochem.ruhr-uni-bochum.de
Lehrstuhl fuer Theoretische Chemie Phone: ++49 (0)234/32-26673
Ruhr-Universitaet Bochum - NC 03/53 Fax: ++49 (0)234/32-14045
D-44780 Bochum http://www.theochem.ruhr-uni-bochum.de/~axel.kohlmeyer/
If you make something idiot-proof, the universe creates a better idiot.
More information about the users