[Pw_forum] abysmal parallel performance of the CP code

Wed Sep 28 19:15:20 CEST 2005

Dear Kostya,

a long time ago I had collated some of our tests on parallel
performance - parallelism over gigabit ethernet
used to go very well up to 4 nodes, and then flatten out.
All the details are in here:

http://nnn.mit.edu/ESPRESSO/CP90_tests/

The "large" system has ~500 bands, so it's a reasonably challenging
task. All the timings are always wall clock time.

We have more recent tests in the group (maybe someone could post them)
on Dual Xeon 3.4 Ghz, and 800 Mhz FSB. Our detailed experience there,
with a lot of help from Intel, was that the second CPU, living on the
same FSB, provides at best a 25%-30% performance boost. Opterons
achieve on the other hand excellent scaling on the two CPUs on the same
motherboard, but have worse performance at the single CPU level (I 
believe thanks to the extensive use of MKL by the CP code).

My own feeling as the best platform for CP is nodes of 4 or 6 PIVs,
with the fastest memory and FSB available. Gigabit will do fine
for this small number of nodes. No idea about the recent dual core.

Anyone would be more than welcome to provide updated numbers for the 
tests above (Axel did ! And I haven't collated them yet...) - do
keep in mind that in order to be accurate and faithful,
the machines needs to be completely empty (i.e. no other jobs running).
Also, it makes a big difference running on 4 cpus on 4 dual xeon,
where on each platform the second CPU is idle, than on 4 cpus on
two dual xeons, all running.  The OMP_NUM_THREADS is always set to 1.

Let us know,

			nicola

Konstantin Kudin wrote:

>  Thanks to Axel and Alexander for suggestions on this issue! Some of
> them were quite helpful.
> 
>  I have investigated things further, in a more careful way. What
> happens is that the GOTO library uses its own threads, and that causes
> the CP90 time to be underestimated while the wall time stays roughly
> the same. This also makes 1cpu times look small, since the 2nd GOTO
> thread is quietly computing whenever the 2nd cpu is available.
> 
>  I used my own test case, with water and some organic stuff in it. The
> run does 20 CP steps starting from a restart file on a local disk. I
> eliminated the nfs issues by writing the stuff directly to the local
> disk on the head node. Also, for runs with 1 thread the 2nd cpu on the
> dual node was kept idle.
> 
>  For mpich1 it seems that there is little difference in wall times
> whether shared memory is used or not. However, CP90 times appear much
> smaller for sockets. For mpich2 shared memory just hangs, while sockets
> work. Also, mpich2 is worse in bad situations (more than 4 cpus) than
> mpich1. The launcher was compiled accordingly mpiexec from Ohio SC in
> all cases.
> 
>  Below are some actual numbers. 1 or 2 at the end of the job name means
>  that the GOTO_NUM_THREADS was either 1 or 2.
> 
>  It looks that 2 dual nodes with 1Gbit is the fastest, and the speedup
> is about 3x. Beyond that things go downhill.
> 
>  Kostya
> 
> #########################
> mpich1 with shared memory
> job name  G  N_thread   Wall    CP90   
> sh1.out.0.1     1       106.5m  1h46m
> sh1.out.0.2     1       85.8m   1h24m
> sh1.out.1.1     2       62.6m   1h
> sh1.out.1.2     2       59.4m   48m
> sh1.out.2.1     4       34.8m   33m51.78s
> sh1.out.2.2     4       34.6m   27m46.72s
> sh1.out.3.1     6       42.7m   37m29.63s
> so1.out.3.2 (hangs)
> sh1.out.4.1     8       45.3m   37m43.38s
> sh1.out.4.2     8       45.6m   35m31.07s
> 
> mpich1 with sockets
> job name  G  N_thread   Wall    CP90 
> so1.out.0.1     1       108.1m  1h47m
> so1.out.0.2     1       84.6m   1h23m
> so1.out.1.1     2       61.7m   1h
> so1.out.1.2     2       62.9m   50m32.96s
> so1.out.2.1     4       35.6m   33m10.21s
> so1.out.2.2     4       35.6m   27m20.13s
> so1.out.3.1     6       42.6m   22m59.73s
> so1.out.3.2     6       42.5m   18m50.57s
> so1.out.4.1     8       45.1m   18m31.35s
> so1.out.4.2     8       43.4m   14m47.97s
> 
> mpich2 with sockets
> so2.out.0.1     1       105.3m  1h45m
> so2.out.0.2     1       84.5m   1h23m
> so2.out.1.1     2       60.7m   1h
> so2.out.1.2     2       59.4m   47m31.77s
> so2.out.2.1     4       34.6m   32m26.08s
> so2.out.2.2     4       34.3m   26m10.19s
> so2.out.3.1     6       53.5m   23m
> so2.out.3.2     6       53.0m   18m
> so2.out.4.1     8       54.0m   17m56.94s
> so2.out.4.2     8       53.2m   14m44.32s
> 
> 
> 
> 		
> __________________________________ 
> Yahoo! Mail - PC Magazine Editors' Choice 2005 
> http://mail.yahoo.com
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum

-- 
---------------------------------------------------------------------
Prof Nicola Marzari   Department of Materials Science and Engineering
13-5066   MIT   77 Massachusetts Avenue   Cambridge MA 02139-4307 USA
tel 617.4522758  fax 617.2586534  marzari at mit.edu  http://nnn.mit.edu