[Pw_forum] abysmal parallel performance of the CP code

Thu Sep 29 10:17:56 CEST 2005

(shorter version)
Dear Nicola,

I was kinda worried about low perfomance of Opterons you have found,
cos we have switched to Opterons from Xeons recently.
So i went to my home computer and made some testings with AgI.small.j
and CP 2.1.5. Here are the details : 

System : Dual DualCore Opteron265 1.8GHz (Overclocked a bit to 1.96 GHz.
By the way, it works perfectly at 2.35GHz, tested many times with
Prime95, linpack, etc. ), 4Gb DDR400(430). 

Compilers : Intel icc 9.0 and ifort 9.0, EM64T. 
Flags : icc -O3 -xP ,
 ifort -O2 -xP for all .f90 except cprstart.o
 ifort -O2 -xW for cprstart.o (to avoid cpu-check error at runtime) 	

BLAS: acml2.7.0
MPI : MPICH2, ch3:shm device

I tested with 1 and 4 threads.

Timings:    10it     5it

1	   11.50     7.35 		

4	   4.0       2.21

With cp.x in Espresso-2.1, close results (a little faster, though)
See also attached .out files. 		

So, much faster even on much slower CPUs! Opterons are not that bad. In
fact, in most cases they are significantly faster than Xeons with
comparable model number.

On Wed, 2005-09-28 at 13:15 -0400, Nicola Marzari wrote:
> 
> Dear Kostya,
> 
> 
> a long time ago I had collated some of our tests on parallel
> performance - parallelism over gigabit ethernet
> used to go very well up to 4 nodes, and then flatten out.
> All the details are in here:
> 
> http://nnn.mit.edu/ESPRESSO/CP90_tests/
> 
> The "large" system has ~500 bands, so it's a reasonably challenging
> task. All the timings are always wall clock time.
> 
> We have more recent tests in the group (maybe someone could post them)
> on Dual Xeon 3.4 Ghz, and 800 Mhz FSB. Our detailed experience there,
> with a lot of help from Intel, was that the second CPU, living on the
> same FSB, provides at best a 25%-30% performance boost. Opterons
> achieve on the other hand excellent scaling on the two CPUs on the same
> motherboard, but have worse performance at the single CPU level (I 
> believe thanks to the extensive use of MKL by the CP code).
> 
> My own feeling as the best platform for CP is nodes of 4 or 6 PIVs,
> with the fastest memory and FSB available. Gigabit will do fine
> for this small number of nodes. No idea about the recent dual core.
> 
> Anyone would be more than welcome to provide updated numbers for the 
> tests above (Axel did ! And I haven't collated them yet...) - do
> keep in mind that in order to be accurate and faithful,
> the machines needs to be completely empty (i.e. no other jobs running).
> Also, it makes a big difference running on 4 cpus on 4 dual xeon,
> where on each platform the second CPU is idle, than on 4 cpus on
> two dual xeons, all running.  The OMP_NUM_THREADS is always set to 1.
> 
> 
> 
> Let us know,
> 
> 			nicola
> 
> 
> 
> Konstantin Kudin wrote:
> 
> >  Thanks to Axel and Alexander for suggestions on this issue! Some of
> > them were quite helpful.
> > 
> >  I have investigated things further, in a more careful way. What
> > happens is that the GOTO library uses its own threads, and that causes
> > the CP90 time to be underestimated while the wall time stays roughly
> > the same. This also makes 1cpu times look small, since the 2nd GOTO
> > thread is quietly computing whenever the 2nd cpu is available.
> > 
> >  I used my own test case, with water and some organic stuff in it. The
> > run does 20 CP steps starting from a restart file on a local disk. I
> > eliminated the nfs issues by writing the stuff directly to the local
> > disk on the head node. Also, for runs with 1 thread the 2nd cpu on the
> > dual node was kept idle.
> > 
> >  For mpich1 it seems that there is little difference in wall times
> > whether shared memory is used or not. However, CP90 times appear much
> > smaller for sockets. For mpich2 shared memory just hangs, while sockets
> > work. Also, mpich2 is worse in bad situations (more than 4 cpus) than
> > mpich1. The launcher was compiled accordingly mpiexec from Ohio SC in
> > all cases.
> > 
> >  Below are some actual numbers. 1 or 2 at the end of the job name means
> >  that the GOTO_NUM_THREADS was either 1 or 2.
> > 
> >  It looks that 2 dual nodes with 1Gbit is the fastest, and the speedup
> > is about 3x. Beyond that things go downhill.
> > 
> >  Kostya
> > 
> > #########################
> > mpich1 with shared memory
> > job name  G  N_thread   Wall    CP90   
> > sh1.out.0.1     1       106.5m  1h46m
> > sh1.out.0.2     1       85.8m   1h24m
> > sh1.out.1.1     2       62.6m   1h
> > sh1.out.1.2     2       59.4m   48m
> > sh1.out.2.1     4       34.8m   33m51.78s
> > sh1.out.2.2     4       34.6m   27m46.72s
> > sh1.out.3.1     6       42.7m   37m29.63s
> > so1.out.3.2 (hangs)
> > sh1.out.4.1     8       45.3m   37m43.38s
> > sh1.out.4.2     8       45.6m   35m31.07s
> > 
> > mpich1 with sockets
> > job name  G  N_thread   Wall    CP90 
> > so1.out.0.1     1       108.1m  1h47m
> > so1.out.0.2     1       84.6m   1h23m
> > so1.out.1.1     2       61.7m   1h
> > so1.out.1.2     2       62.9m   50m32.96s
> > so1.out.2.1     4       35.6m   33m10.21s
> > so1.out.2.2     4       35.6m   27m20.13s
> > so1.out.3.1     6       42.6m   22m59.73s
> > so1.out.3.2     6       42.5m   18m50.57s
> > so1.out.4.1     8       45.1m   18m31.35s
> > so1.out.4.2     8       43.4m   14m47.97s
> > 
> > mpich2 with sockets
> > so2.out.0.1     1       105.3m  1h45m
> > so2.out.0.2     1       84.5m   1h23m
> > so2.out.1.1     2       60.7m   1h
> > so2.out.1.2     2       59.4m   47m31.77s
> > so2.out.2.1     4       34.6m   32m26.08s
> > so2.out.2.2     4       34.3m   26m10.19s
> > so2.out.3.1     6       53.5m   23m
> > so2.out.3.2     6       53.0m   18m
> > so2.out.4.1     8       54.0m   17m56.94s
> > so2.out.4.2     8       53.2m   14m44.32s
> > 
> > 
> > 
> > 		
> > __________________________________ 
> > Yahoo! Mail - PC Magazine Editors' Choice 2005 
> > http://mail.yahoo.com
> > _______________________________________________
> > Pw_forum mailing list
> > Pw_forum at pwscf.org
> > http://www.democritos.it/mailman/listinfo/pw_forum
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AgI.out.1.10.gz
Type: application/x-gzip
Size: 5474 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20050929/b05a38be/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AgI.out.1.5.gz
Type: application/x-gzip
Size: 4928 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20050929/b05a38be/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AgI.out.4.10.gz
Type: application/x-gzip
Size: 5567 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20050929/b05a38be/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AgI.out.4.5.gz
Type: application/x-gzip
Size: 5055 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20050929/b05a38be/attachment-0003.bin>