[Pw_forum] abysmal parallel performance of the CP code
konstantin_kudin at yahoo.com
Thu Sep 22 00:51:55 CEST 2005
Indeed, with a busy network writing restart files does take a while,
even on a single cpu there are couple extra minutes in the wall times.
However, this is nothing like the difference on 2 and more cpus.
We tracked down the slow speed on the quad to the NUMA being turned
off in contrast to duals, so those differences are explainable by the
Still, an interesting question is why CP is such a dog already on 2
cpus as seeing in the huge difference between the wall and cpu times. 2
cpus in a box never talk across the network except for the occasional
I am curious if other clusters behave in the same way as far as wall
vs. cpu times are concerned.
--- Silviu Zilberman <silviu at Princeton.EDU> wrote:
> Hi Kostya,
> I am not sure if that's the case but I also noticed similar problems
> the past. My impression then was that some of the difference is due
> the fact that the reported wall time includes the time it takes to
> the initial and write the final restart files to the disk, in
> to the reported CPU time. In some cases, if the cluster network is
> loaded, it may take several minutes (!) to write big files (hundreds
> MB). In CP the restart file is not partitioned as in PW, so there is
> lot of traffic in collecting data from all the nodes and then
> writing it to the disk. For long runs you don't see the effect of the
> additional last disk write, but when having only 20 md steps, it may
> become dominant. Were you also writing intermediate restart files
> the 20 steps of the benchmark?
> Konstantin Kudin wrote:
> > Hi,
> > I've done some parallel benchmarks for the CP code so I thought I'd
> >share them with the rest of the group. The system we have is a
> >of dual Opterons 2.0 Ghz with 1Gbit ethernet.
> > I looked at 2 different measures of time, CPU time, and wall time
> >computed as the difference between "This run was started" and "This
> >was terminated". By the way, such wall time could probably be
> >by the code directly to be readily available.
> > The system is a reasonably sized simulation cell with 20 CP
> >(electronic+ionic) steps total.
> > The compiler is IFC 9.0, GOTO library is for BLAS, and mpich 1.2.6
> >used for the MPI. The CP version is the CVS from Aug. 20, 2005.
> > What is crazy is that even for 2 cpus sitting in the same box there
> >lots of cpu time just lost somewhere. The strange thing is that the
> >quad we have at 2.2 Ghz seems to lose just as much wall time as 2
> >talking across the network. And note how 4 cpus are barely better
> >2x compared to single cpu performance if the wall clock time is
> > I know Nicola Marzari has done some parallel benchmarks, but I do
> >think that wall times were being paid attention to ...
> > Kostya
> >P.S. Any suggestions what might be going on here?
> >Ncpu CPU time Wall time
> >1 1h22m 1h24m
> >2 45m33.41s 57m13s
> >4 27m30.80s 44m21s
> >6 18m22.71s 43m18s
> >8 14m53.91s 45m56s
> >4(quad) 37m18.56s 45m32s
> >Do You Yahoo!?
> >Tired of spam? Yahoo! Mail has the best spam protection around
> >Pw_forum mailing list
> >Pw_forum at pwscf.org
> Zilberman Silviu
> 213 Frick Laboratory, Department of Chemistry
> Princeton University
> Princeton, NJ 08544
> phone: 609-258-1834
> fax: 609-258-6746
> silviu at Princeton.EDU
> > begin:vcard
> fn:Silviu Zilberman
> org:Princeton University;Department of Chemistry
> adr:;;213 Frick Laboratory;Princeton;New-Jersy;08544;USA
> title:Research Associate
Yahoo! Mail - PC Magazine Editors' Choice 2005
More information about the users