[Pw_forum] openmp vs mpich performance with MKL 10.x

Tue May 6 21:27:24 CEST 2008

On Tue, 6 May 2008, Nicola Marzari wrote:

NM> 
NM> Dear Eduardo,

hi nicola,

NM> 1) no improvements with the Intel fftw2 wrapper, as opposed to fftw2
NM> Q-E sources, when using mpi. I also never managed to successfully run
NM> with the Intel fftw3 wrapper (or with fftw3 - that probably says 
NM> something about me).

no it doesn't. i 

NM> 2) great improvements of a serial code (different from Q-E) when using
NM> the automatic parallelism of MKL in quad-cores.

nod. i just yesterday made some tests with different BLAS/LAPACK 
implementations, and it turns out that the 10.0 MKL is pretty
effecient in parallelizing tasks like DGEMM. through use of 
SSE2/3/4 and multi threading you can get easily a factor of 6
improvement on a 4-core node.

NM> 
NM> 3) btw, MPICH has always been for us the slower protocol, compared with
NM> LAMMPI or OpenMPI
NM> 
NM> I actually wonder if the best solution on a quad-core would be, say,
NM> to use two cores for MPI, and the other two for the openmp threads.

this is a _very_ tricky issue. usually for a plane wave pseudopotential
codes, that distributed data parallelization is pretty efficient, except
for the 3d-fourier transforms across the whole data set, which are very
sensitive to network latencies. for jobs using k-points, you also have
the option to parallelize over k-points with is very efficient, even on
not so fast networks. with the CVS versions, you have another level of
parallelism added (parallelisation over function instead of data = task
groups). thus given an ideal network, you first want to exploit MPI 
parallelism maximally and then what is left is rather small, and - sadly 
- OpenMP doesn't work very efficiently on that. the overhead of 
spawning, synchronizing and joining threads is too high compared to
the gain through parallelism. 

but we live in a real world and there are unexpected side effects and
non-ideal machines and networks. e.g. when using nodes with many cores, 
e.g. two-socket quad-core, you have to "squeeze" a lot of communication 
through just one network card (be it infiniband, myrinet or ethernet) 
that will serialize communication and add unwanted conflicts and latencies. 
i've seen this happen particularly when using a very large number 
of nodes where you can run out of (physical) memory simply because
of the way how the lowlevel communication was programmed.

in that case you may indeed be better off using only half or a 
quarter of the cores with MPI and then set OMP_NUM_THREADS to 2 
or even keep it at 1 (because that will, provided you have an
MPI with processor affinity and optimal job placement, double the
cpu cache). it is particularly interesting to discuss from this 
perspective having multi-core nodes connected by a high-latency
TCP/IP network (e.g. gigabit ethernet). here with one MPI task per
node you reach the limit of scaling pretty fast, and also using 
multiple MPI tasks per node is mostly multiplying the latencies, 
which is not helping. under those circumstance the data set is still 
rather large and then OpenMP parallelism can help to get the most
out of a given machine. as noted before, it would be _even_ better
if OpenMP directives were added to time critical and multi-threadable
parts of QE. i have experienced this in CPMD where i managed to
get about 80% of the MPI performance with the latest (extensively 
threaded) development sources and a fully-multi-threaded toolchain 
on a single node. however, running across multiple nodes quickly 
reduces the effectivity of the OpenMP support. just with two nodes
you are at 60% only.

now, deciding on what is the best combination of options is a
very tricky multi-dimensional optimization problem you have to 
consider the following:

- the size of the typical problem and job type
- whether you can benefit from k-point parallelism
- whether you prefer faster execution over cost 
  efficiency and throughput.
- the total amount of money you want to spend
- the skillset of people that have to run the machine
- how many people have to share the machine.
- how I/O bound the jobs are.
- how much memory you need and how much money you 
  are willing to invest in faster memory.
- failure rates and the level of service
  (gigabit equipment is easily available).

also some of those parameters are (non-linearly) coupled
which makes the decision making process even nastier.

cheers,
   axel.

NM> 
NM> I eagerly await Axel's opinion.
NM> 
NM> 			nicola
NM> 
NM> Eduardo Ariel Menendez Proupin wrote:
NM> > Hi,
NM> > I have noted recently that I am able to obtain faster binaries of pw.x 
NM> > using the the OpenMP paralellism implemented in the Intel MKL libraries 
NM> > of version 10.xxx, than using MPICH, in the Intel cpus. Previously I had 
NM> > always gotten better performance using MPI. I would like to know of 
NM> > other experience on how to make the machines faster. Let me explain in 
NM> > more details.
NM> > 
NM> > Compiling using MPI means using mpif90 as linker and compiler, linking 
NM> > against mkl_ia32 or mkl_em64t, and using link flags -i-static -openmp. 
NM> > This is just the  what appears in the make.sys after running configure 
NM> > in version 4cvs,
NM> > 
NM> > At runtime, I set
NM> > export OMP_NUM_THREADS=1
NM> > export MKL_NUM_THREADS=1
NM> > and run using
NM> > mpiexec -n $NCPUs pw.x <input >output
NM> > where NCPUs  is the number of cores available in the system.
NM> > 
NM> > The second choice is
NM> > ./configure --disable-parallel
NM> > 
NM> > and at runtime
NM> > export OMP_NUM_THREADS=$NCPU
NM> > export MKL_NUM_THREADS=$NCPU
NM> > and run using 
NM> > pw.x <input >output
NM> > 
NM> > I have tested it in Quadcores (NCPU=4) and with an old Dual Xeon B.C. 
NM> > (before cores) (NCPU=2).
NM> > 
NM> > Before April 2007, the first choice had always workes faster. After 
NM> > that, when I came to use the MKL 10.xxx, the second choice is working 
NM> > faster. I have found no significant difference between version 3.2.3 and 
NM> > 4cvs.
NM> > 
NM> > A special comment is for the FFT library. The MKL has a wrapper to the 
NM> > FFTW, that must be compiled after instalation (it is very easy). This 
NM> > creates additional libraries named like libfftw3xf_intel.a and 
NM> > libfftw2xf_intel.a
NM> > This allows improves the performance in the second choice, specially 
NM> > with libfftw3xf_intel.a.
NM> > 
NM> > Using MPI, libfftw2xf_intel.a is as fast as using the FFTW source 
NM> > distributed with espresso, i.e., there is no gain in using 
NM> > libfftw2xf_intel.a. With  libfftw3xf_intel.a and MPI, I have never been 
NM> > able to run pw.x succesfully, it just aborts.
NM> > 
NM> > I would like to hear of your experiences.
NM> >  
NM> > Best regards
NM> > Eduardo Menendez
NM> > University of Chile
NM> > 
NM> > 
NM> > ------------------------------------------------------------------------
NM> > 
NM> > _______________________________________________
NM> > Pw_forum mailing list
NM> > Pw_forum at pwscf.org
NM> > http://www.democritos.it/mailman/listinfo/pw_forum
NM> 
NM> 
NM> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.