[Pw_forum] abysmal parallel performance of the CP code
konstantin_kudin at yahoo.com
Wed Sep 21 19:48:32 CEST 2005
I've done some parallel benchmarks for the CP code so I thought I'd
share them with the rest of the group. The system we have is a cluster
of dual Opterons 2.0 Ghz with 1Gbit ethernet.
I looked at 2 different measures of time, CPU time, and wall time
computed as the difference between "This run was started" and "This run
was terminated". By the way, such wall time could probably be printed
by the code directly to be readily available.
The system is a reasonably sized simulation cell with 20 CP
(electronic+ionic) steps total.
The compiler is IFC 9.0, GOTO library is for BLAS, and mpich 1.2.6
used for the MPI. The CP version is the CVS from Aug. 20, 2005.
What is crazy is that even for 2 cpus sitting in the same box there is
lots of cpu time just lost somewhere. The strange thing is that the
quad we have at 2.2 Ghz seems to lose just as much wall time as 2 duals
talking across the network. And note how 4 cpus are barely better than
2x compared to single cpu performance if the wall clock time is
I know Nicola Marzari has done some parallel benchmarks, but I do not
think that wall times were being paid attention to ...
P.S. Any suggestions what might be going on here?
Ncpu CPU time Wall time
1 1h22m 1h24m
2 45m33.41s 57m13s
4 27m30.80s 44m21s
6 18m22.71s 43m18s
8 14m53.91s 45m56s
4(quad) 37m18.56s 45m32s
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
More information about the users