[Pw_forum] FFT & MPI - issues with latency

Sun Jan 29 21:02:50 CET 2006

On Sun, 29 Jan 2006, Konstantin Kudin wrote:

KK>  Hi all,

hi kostya,

KK>  I am wondering if anybody has investigated systematically the
KK> performance of different MPI implementations with the QE package. Was
KK> anyone successful in using OPEN-MPI which is the project spawned off by
KK> the LAM-MPI people ?

you are opening a _big_ can of worms. not only the different 
implementations matter, there are also several tunable parameters
for each of them, that may or may not help.

KK>  The issue seem to be that FFTs are very latency driven, and here are
KK> some tests from the past
KK> http://www.hpc.sfu.ca/bugaboo/fft-performance.html which seem to
KK> indicate that MPICH and LAM-MPI differ significantly with respect to
KK> their performance as it applies to the FFTW library.

it is not the fft itself, but a _distributed_ fft, where the 
performance basically boils down to the performance of the 
MPI_AllToAll implementation. the observation in the URL is still
consistent with my more recent experiments, IFF you are using
ethernet communication. here the TCP/IP overhead is very high
and LAM-MPI has the better collective communication algorithms.
still the extremely large overhead of the TCP/IP limits the 
scaling significantly.

the picture changes, however, in the case of myrinet interconnect,
where the currently released LAM/MPI version is still not optimal
and has higher latencies compared to MPICH-gm.

KK>  The experience we have here with QE on dual Opterons is that there is
KK> no gain in the cpu time at all if one goes above 2 nodes with 2 cpus
KK> using Gigabit. CPMD on this Opteron cluster behaved pretty much
KK> identically to CP, thus indicating that the problem transends slight

note that this is only true for the G/R-space decomposition.
the scaling in that case is indeed very limited, but for large
enough problems you may be able to go up to 6-8 nodes. in the
case of CPMD there is an additional improvement, when you cache
the G-space to R-space fourier transforms using REAL SPACE WFN KEEP
and thus limit the dependency of all-to-all. 

KK> differences in implementations. Nicola's bencharks with LAM-MPI [
KK> http://nnn.mit.edu/ESPRESSO/CP90_tests/CP90.timings.large ] seemed to
KK> indicate that there was some improvement after 4 cpus.
KK> 
KK>  So now I am starting to think that MPICH may be responsible for no
KK> improvement at all above 4 cpus.

to get a reasonable scaling you first have to get a low latency
communication. this is most easily (but also at a high cost) achieved
by buying hardware (myrinet, infiniband, dolphin-SCI, quadrics), but
there are also some MPI implementations, that support communication, 
that bypasses the TCP layer, e.g. scali, and parastation. see.
http://www.scali.com
http://www.cluster-competence-center.com/ccc2.php?page=parastation&lang=en

that latter is supposedly available at no cost for academic institutions 
now. BTW: both packages started as modified MPICH version.
i have done a test with parastation when it was fully commercial and
found some improvements, but not enough to warrant at that time the 
price compared to the gain from buying faster communication hardware.

if you want to get to _extreme_ scaling, latency is all that matters.
then even the operating system, that is running a lot of stuff in the
background can become a problem. extremely scalable machines like BG/L
or Cray XT3 eliminate this by running parallel jobs in a transputer 
style, i.e. by having a modified kernel on the compute nodes, that will 
run only one process: your computation. e.g., on a cray xt3 i was 
recently able to run a pw.x job across 768 processors (12 k-points, 
= 64 PEs per k-point), where the performance was only limited by
the wavefunction i/o. if we would have an option diskio='none', that
would probably give a much better performace and scale even better.

on the same machine using CPMD with all available tricks enabled, one
could scale a 100ry / 32 water system easily to 1024 processors, where
the time spent on i/o would limit further scaling.

KK> 
KK>  Ideas ?

my conclusion is: if you have the money, buy the hardware. 

in that case you should also have an eye on the topology. the more 
nodes you want to hook up, the larger the advantage of switchless, 
2-d or 3-d torus communication (dolphin-sci, cray-xt3, BG/L) over 
switch based systems like myrinet and infiniband. for many nodes 
you have to build CLOS-networks and thus get additional latencies 
and suffer from backplane bandwidth limitations.

if you don't have the money, support people working on TCP-bypass
MPI implementations and hope for the best.

for the (comparatively small) differences between current MPI 
implementations, i would prefer correctness and convenience 
over speed. if a package is difficult to use, people will 
waste more time on failed jobs, than you'd gain from the 
faster library. also improving the QE code itself to reduce 
or avoid those 'costly' operations, is probably more useful.

best regards,
     axel.

KK> 
KK>  Kostya
KK> 
KK> __________________________________________________
KK> Do You Yahoo!?
KK> Tired of spam?  Yahoo! Mail has the best spam protection around 
KK> http://mail.yahoo.com 
KK> _______________________________________________
KK> Pw_forum mailing list
KK> Pw_forum at pwscf.org
KK> http://www.democritos.it/mailman/listinfo/pw_forum
KK> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.