[Pw_forum] openmp vs mpich performance with MKL 10.x

Tue May 6 21:39:52 CEST 2008

On Tue, 6 May 2008, Eduardo Ariel Menendez Proupin wrote:

EAMP> > there are two issue that need to be considered.
EAMP> >
EAMP> > 1) how large are your test jobs? if they are not large enough, timings are
EAMP> > pointless.

EAMP> about 15 minutes in Intel Quadcore. 66 atoms: Cd_30Te_30O_6. 576 
EAMP> electrons in total. My test may be very particular. If you a have 

hmmm... that is pretty large. would you mind sending me the input.
i'd like to make some verifications on my machine (two-socket dual core).

EAMP> a balanced benchmark, I would like to run it.

i've only done these kind of benchmarks systematically with CPMD,
and only a few confirmation tests with QE. in general the G-space
parallelization is comparable. while the individual performance for
a specific problem can be quite different (QE is far superior with
ultra-soft and k-points, CPMD outruns cp.x with norm-conserving
pseudos), the scaling behavior was always quite similar for small
to medium numbers of nodes.

EAMP> > 2) it is most likely, that you are still tricked by the
EAMP> >   auto-parallelization of intel MKL. the export OMP_NUM_THREADS
EAMP> >   will usually only work for the _local_ copy, for some
EAMP> >   MPI startup mechanisms not at all. thus your MPI jobs will
EAMP> >   be slowed down.
EAMP> 
EAMP> I am using only SMP. Sorry, I still haven't a cluster of Quadcores.

that still does not mean that the environment is exported. some
MPICH versions have pretty awkward ways of starting MPI environments
that do not always forward the environment at all.

EAMP> >   to make certain that you only like the serial version of
EAMP> >   MKL with your MPI executable, please replace  -lmkl_em64t
EAMP> >   in your make.sys file with
EAMP> >   -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
EAMP> 
EAMP> 
EAMP> Yes, I also tried that. The test runs in 14m2s. Using only -lmkl_em64t it
EAMP> runs in 14m31s. Using serial compilations it ran in 12m20s.

you should also compare against the parallel executable
run with -np 1 against the serial executable.

depending on your hardware (memory speed) and the fact that the 
10.0 MKL has about 20% speed improvement on recent cpus, it is 
quite possible. since your problem is quite large, i guess that
a lot of time is spent in the libraries.

with a single quad-core cpu you also have the maximum amount of
memory contention when running 4 individual mpi threads, whereas
using multi-threading may take better advantage of data locality
and reduce the load on the memory bus.

cheers,
    axel.

EAMP> 
EAMP> 
EAMP> 
EAMP> Thanks,
EAMP> Eduardo
EAMP> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.