[Pw_forum] openmp vs mpich performance with MKL 10.x

Tue May 6 20:51:51 CEST 2008

On Tue, 6 May 2008, Eduardo Ariel Menendez Proupin wrote:

EAMP> Hi,
EAMP> I have noted recently that I am able to obtain faster binaries of pw.x using
EAMP> the the OpenMP paralellism implemented in the Intel MKL libraries of version
EAMP> 10.xxx, than using MPICH, in the Intel cpus. Previously I had always gotten
EAMP> better performance using MPI. I would like to know of other experience on
EAMP> how to make the machines faster. Let me explain in more details.
EAMP> 
EAMP> Compiling using MPI means using mpif90 as linker and compiler, linking
EAMP> against mkl_ia32 or mkl_em64t, and using link flags -i-static -openmp. This
EAMP> is just the  what appears in the make.sys after running configure in version
EAMP> 4cvs,
EAMP> 
EAMP> At runtime, I set
EAMP> export OMP_NUM_THREADS=1
EAMP> export MKL_NUM_THREADS=1
EAMP> and run using
EAMP> mpiexec -n $NCPUs pw.x <input >output
EAMP> where NCPUs  is the number of cores available in the system.

EAMP> 
EAMP> The second choice is
EAMP> ./configure --disable-parallel
EAMP> 
EAMP> and at runtime
EAMP> export OMP_NUM_THREADS=$NCPU
EAMP> export MKL_NUM_THREADS=$NCPU
EAMP> and run using
EAMP> pw.x <input >output
EAMP> 
EAMP> I have tested it in Quadcores (NCPU=4) and with an old Dual Xeon B.C.
EAMP> (before cores) (NCPU=2).
EAMP> 
EAMP> Before April 2007, the first choice had always workes faster. After that,
EAMP> when I came to use the MKL 10.xxx, the second choice is working faster. I
EAMP> have found no significant difference between version 3.2.3 and 4cvs.
EAMP> 
EAMP> A special comment is for the FFT library. The MKL has a wrapper to the FFTW,
EAMP> that must be compiled after instalation (it is very easy). This creates
EAMP> additional libraries named like libfftw3xf_intel.a and libfftw2xf_intel.a
EAMP> This allows improves the performance in the second choice, specially with
EAMP> libfftw3xf_intel.a.
EAMP> 
EAMP> Using MPI, libfftw2xf_intel.a is as fast as using the FFTW source
EAMP> distributed with espresso, i.e., there is no gain in using
EAMP> libfftw2xf_intel.a. With  libfftw3xf_intel.a and MPI, I have never been able
EAMP> to run pw.x succesfully, it just aborts.
EAMP> 
EAMP> I would like to hear of your experiences.

eduardo,

there are two issue that need to be considered.

1) how large are your test jobs? if they are not large enough, timings are pointless.

2) it is most likely, that you are still tricked by the 
   auto-parallelization of intel MKL. the export OMP_NUM_THREADS
   will usually only work for the _local_ copy, for some
   MPI startup mechanisms not at all. thus your MPI jobs will
   be slowed down.

   to make certain that you only like the serial version of
   MKL with your MPI executable, please replace  -lmkl_em64t
   in your make.sys file with 
   -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
   you may have to add:
   -Wl,-rpath,/opt/intel/path/to/your/mkl
   to make your executable find the libraries at runtime.

   with those executable you can try again, and i would be
   _very_ surprised if using MPI is slower than serial and
   multi-threading. i made tests with intel FFT vs. FFTW
   in a number of plane wave codes and the intel FFT was 
   always slower.

cheers,
    axel.

EAMP> 
EAMP> Best regards
EAMP> Eduardo Menendez
EAMP> University of Chile
EAMP> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.