[Pw_forum] abysmal parallel performance of the CP code

Wed Sep 21 19:48:32 CEST 2005

 Hi,

 I've done some parallel benchmarks for the CP code so I thought I'd
share them with the rest of the group. The system we have is a cluster
of dual Opterons 2.0 Ghz with 1Gbit ethernet.

 I looked at 2 different measures of time, CPU time, and wall time
computed as the difference between "This run was started" and "This run
was terminated". By the way, such wall time could probably be printed
by the code directly to be readily available.

 The system is a reasonably sized simulation cell with 20 CP
(electronic+ionic) steps total.

 The compiler is IFC 9.0, GOTO library is for BLAS, and mpich 1.2.6
used for the MPI. The CP version is the CVS from Aug. 20, 2005.

 What is crazy is that even for 2 cpus sitting in the same box there is
lots of cpu time just lost somewhere. The strange thing is that the
quad we have at 2.2 Ghz seems to lose just as much wall time as 2 duals
talking across the network. And note how 4 cpus are barely better than
2x compared to single cpu performance if the wall clock time is
considered.

 I know Nicola Marzari has done some parallel benchmarks, but I do not
think that wall times were being paid attention to ...

 Kostya

P.S. Any suggestions what might be going on here?

Ncpu	CPU time	Wall time
1	1h22m		1h24m
2	45m33.41s	57m13s
4	27m30.80s	44m21s
6	18m22.71s	43m18s
8	14m53.91s	45m56s

4(quad) 37m18.56s	45m32s

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com