[Pw_forum] Insufficient Virtual Memory
Axel Kohlmeyer
akohlmey at cmm.chem.upenn.edu
Sat Feb 28 19:23:35 CET 2009
On Fri, Feb 27, 2009 at 6:34 PM, Vo, Trinh <trinh.vo at jpl.nasa.gov> wrote:
> Dear Axel,
>
> Thanks for clarification.
>
> About the benchmarks, I just simply to see how well is the performance of
> the cluster we bought in term of scaling with QE. I sent some plots to
> you, but the email did not go thru because of the restriction of the size
> (larger than 40K).
>
> Currently, I am not happy at the fact that the difference in CPU time and
> wall time is too large. When I run a longer job, which took ~2h CPU time
> long, the wall time was ~7h when I run from the head node, and ~4h when I
that probably means you ran a job that was too big in the machine and
thus swapping all the time.
for your reference, here are some numbers from one of our local clusters.
the machine has: 2x Intel Xeon E5430 @ 2.66GHz and 8GB per node
and a 2xDDR infiniband interconnect.
this first block are runs with four nodes and different -npernode numbers:
h2o-32-4x2.out: CP : 18.33s CPU time, 18.79s wall time
h2o-32-4x4.out: CP : 16.75s CPU time, 17.50s wall time
h2o-32-4x8.out: CP : 25.31s CPU time, 25.94s wall time
h2o-64-4x1.out: CP : 2m50.18s CPU time, 3m18.88s wall time
h2o-64-4x2.out: CP : 1m29.72s CPU time, 1m33.60s wall time
h2o-64-4x4.out: CP : 1m12.42s CPU time, 1m13.70s wall time
h2o-64-4x8.out: CP : 1m19.53s CPU time, 1m20.86s wall time
as you can see, same as with cp2k, using 8 cores per node is hurting
performance,
especially for smaller jobs, and using 4 cores per node is a much better choice.
and here the corresponding single node times (run on the frontend):
h2o-32-np1.out: CP : 2m24.38s CPU time, 2m39.38s wall time
h2o-32-np2.out: CP : 1m24.22s CPU time, 1m42.09s wall time
h2o-32-np4.out: CP : 48.92s CPU time, 51.58s wall time
h2o-32-np8.out: CP : 41.89s CPU time, 42.72s wall time
h2o-64-np2.out: CP : 6m39.17s CPU time, 7m49.54s wall time
h2o-64-np4.out: CP : 4m19.69s CPU time, 5m14.73s wall time
h2o-64-np8.out: CP : 4m12.16s CPU time, 4m24.57s wall time
the saturation of the memory bandwidth becomes apparent (little gain going
from 4 mpi tasks to 8 mpi tasks). you have to keep in mind on the intel quad
cores, the difference between using 4 cores and 8 cores is especially drastic,
as the cpus share caches between two cores, so with 4 cores i have effectively
double the L2-cache as with 8 cores. it would be interesting to see somebody
do a similar test with AMD quad cores, since those are true quad cores.
you should also note, that those timings contain some non-parallel overhead
that happens when starting a job. for testing production speed you should
run a 20 step and a 10 step job and then subtract the time for the 10 step
job from the 20 step job to get the timing for 10 steps.
HTH,
axel.
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum
>
>
--
=======================================================================
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
More information about the users
mailing list