[Pw_forum] abysmal parallel performance of the CP code

Konstantin Kudin konstantin_kudin at yahoo.com
Wed Sep 28 18:34:48 CEST 2005

 Thanks to Axel and Alexander for suggestions on this issue! Some of
them were quite helpful.

 I have investigated things further, in a more careful way. What
happens is that the GOTO library uses its own threads, and that causes
the CP90 time to be underestimated while the wall time stays roughly
the same. This also makes 1cpu times look small, since the 2nd GOTO
thread is quietly computing whenever the 2nd cpu is available.

 I used my own test case, with water and some organic stuff in it. The
run does 20 CP steps starting from a restart file on a local disk. I
eliminated the nfs issues by writing the stuff directly to the local
disk on the head node. Also, for runs with 1 thread the 2nd cpu on the
dual node was kept idle.

 For mpich1 it seems that there is little difference in wall times
whether shared memory is used or not. However, CP90 times appear much
smaller for sockets. For mpich2 shared memory just hangs, while sockets
work. Also, mpich2 is worse in bad situations (more than 4 cpus) than
mpich1. The launcher was compiled accordingly mpiexec from Ohio SC in
all cases.

 Below are some actual numbers. 1 or 2 at the end of the job name means
 that the GOTO_NUM_THREADS was either 1 or 2.

 It looks that 2 dual nodes with 1Gbit is the fastest, and the speedup
is about 3x. Beyond that things go downhill.


mpich1 with shared memory
job name  G  N_thread   Wall    CP90   
sh1.out.0.1     1       106.5m  1h46m
sh1.out.0.2     1       85.8m   1h24m
sh1.out.1.1     2       62.6m   1h
sh1.out.1.2     2       59.4m   48m
sh1.out.2.1     4       34.8m   33m51.78s
sh1.out.2.2     4       34.6m   27m46.72s
sh1.out.3.1     6       42.7m   37m29.63s
so1.out.3.2 (hangs)
sh1.out.4.1     8       45.3m   37m43.38s
sh1.out.4.2     8       45.6m   35m31.07s

mpich1 with sockets
job name  G  N_thread   Wall    CP90 
so1.out.0.1     1       108.1m  1h47m
so1.out.0.2     1       84.6m   1h23m
so1.out.1.1     2       61.7m   1h
so1.out.1.2     2       62.9m   50m32.96s
so1.out.2.1     4       35.6m   33m10.21s
so1.out.2.2     4       35.6m   27m20.13s
so1.out.3.1     6       42.6m   22m59.73s
so1.out.3.2     6       42.5m   18m50.57s
so1.out.4.1     8       45.1m   18m31.35s
so1.out.4.2     8       43.4m   14m47.97s

mpich2 with sockets
so2.out.0.1     1       105.3m  1h45m
so2.out.0.2     1       84.5m   1h23m
so2.out.1.1     2       60.7m   1h
so2.out.1.2     2       59.4m   47m31.77s
so2.out.2.1     4       34.6m   32m26.08s
so2.out.2.2     4       34.3m   26m10.19s
so2.out.3.1     6       53.5m   23m
so2.out.3.2     6       53.0m   18m
so2.out.4.1     8       54.0m   17m56.94s
so2.out.4.2     8       53.2m   14m44.32s

