[Pw_forum] scaling on clusters with different communication types
akohlmey at cmm.chem.upenn.edu
Sun Dec 10 01:55:05 CET 2006
On Sat, 9 Dec 2006, Kara, Abdelkader wrote:
KK> Thanks Axel for your valuable reply.
KK> I am actually looking for a quantitative comparison.
as i wrote before, this depends a _lot_ on the kind of
jobs you intend to run. please provide a representative
input file, and ask people to run it on hardware you are
interested in. quantitative data, that is irrelevant to
your interests is worthless.
KK> I want to know if by using the infiniband or myrinet one can push
KK> the good (quasilinear) scaling to a higher number of CPU's. If
KK> someone did any benchmarking, I will appreciate your input.
this is _exactly_ what i wrote: once you run out of k-points and/or
NEB images to parallelize over, you _can_ parallelize the G/R-space
part. over gigabit ethernet this works for a few more cpu cores
(4-10 depending on system size). if you want to push further you _have_
to have a low latency network (i.e. myrinet or infiniband or dolphin/SCI
or Quadrics or NUMA-link or SeaStar or ...) since the G-space parallelization
has a lot of latency sensitive all-to-all communication. how _far_ you
can push it then depends on the system size and the relative cpu/memory
speed your input and many, many other details. note, that
different programs and/or algorithms in the quantum espresso
package are parrelelize to a different degree and amdahl's law
dictates that you can only scale as much as the amount of parallelism
of the code allows, regardless of the speed of the interconnect.
e.g. a while ago, i've been able to run a 272 atom, 560 electron
pw.x job with a 4x4x4 MP-kpoint mesh across 768 processors on a
cray xt3 (using some special tricks to reduce file i/o since that
machine hase no local disk at all). but i don't expect most typical
pw.x jobs to scale that far and it only works because the cray xt3
has a special 3d-torus network and a lightweight kernel, that provide
a high aggregate bandwith and avoid all kinds of 'OS noise' that
ususally limit scaling on PC clusters. there have been reports for
running with decent speedup across even more cpus (by andrea
quite a while ago, carlo sbraccia was posting some benchmark inputs
for pw.x and asked people run them and post the results. i don't
know how applicable those numbers will be to you (see above and
because of optimizations of the code and new hardware).
please check the mailing list archives.
KK> Thanks again
KK> Kader Kara
KK> From: Axel Kohlmeyer [mailto:akohlmey at cmm.chem.upenn.edu]
KK> Sent: Fri 12/8/2006 8:47 PM
KK> To: Kara, Abdelkader
KK> Cc: pw_forum at pwscf.org
KK> Subject: Re: [Pw_forum] scaling on clusters with different communication types
KK> On Fri, 8 Dec 2006, Kara, Abdelkader wrote:
KK> AK> Dear all,
KK> AK> Greatings.
KK> AK> I will appreciate it very much if you can share with me your experience
KK> AK> of running pwscf on clusters with different communication hardware.
KK> AK> I am interested in the scaling with the number of CPU's for the following 3
KK> AK> different communication types:
KK> AK> 1)gigabit ethernet
KK> AK> 2) myrinet
KK> AK> 3) InfiniBand
KK> scaling depends a lot on the kind of jobs you intend to run.
KK> pw.x scales almost independendly and very well across NEB
KK> images and k-points even with gigabit ethernet. on top of
KK> that you can parallelize across g-space, which is much more
KK> demanding in terms of communication bandwidth and latency.
KK> in this case scaling across gigabit is limited to a few
KK> nodes. in-node performance is governed by available memory
KK> bandwidth wich results in hyper-threading being conter-productive,
KK> multi-core cpus having reduced efficiency (depending on job
KK> size, i.e. cache efficiency) and opteron cpus due to dedicated
KK> per-cpu memory busses scaling better than intel (xeon). only
KK> very recent intel (woodcrest) xeon cpus have been demonstrated
KK> to have a somewhat better performance and price/performance ratio.
KK> please note, that these are some general trends observed from
KK> some usage patterns that may not translate to your needs.
KK> also presence of a per-node local scratch area or absence
KK> impacts the performance. using a NFS filesystem for temporary
KK> storage usually results in degraded performance.
KK> basically, the larger your systems and the fewer k-points
KK> you need use, the more important a fast interconnect becomes.
KK> performance between infiniband and myrinet solutions is
KK> more or less equivalent when compared with gigabit. thus
KK> using older/obsolete hardware can be a bargain.
KK> AK> Thank you very much for your input on this matter
KK> AK> Kader Kara
KK> AK> Physics Department
KK> AK> University of Central Florida
KK> AK> _______________________________________________
KK> AK> Pw_forum mailing list
KK> AK> Pw_forum at pwscf.org
KK> AK> http://www.democritos.it/mailman/listinfo/pw_forum
KK> Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
KK> Center for Molecular Modeling -- University of Pennsylvania
KK> Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
KK> tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
KK> If you make something idiot-proof, the universe creates a better idiot.
KK> Pw_forum mailing list
KK> Pw_forum at pwscf.org
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
If you make something idiot-proof, the universe creates a better idiot.
More information about the users