[QE-users] QE 6.4 - slower with intel fftw? how to properly benchmark

Ye Luo xw111luoye at gmail.com
Sat Mar 2 22:59:18 CET 2019


1. You are looking at the wrong numbers, please check out the WALL instead
of CPU.
2. With 2 threads per core you are using hardware threads (HT) which share
the resource of physical cores. On a few architectures, HT do boost
performance of QE, on most architectures, HT can play nothing or even
negative on the performance of QE because HTs fight for the shared
resource. The basic strategy is just trying it. You keep using it if you
gain something. If not, just don't use HT.
3. OpenMP is implemented targeting physical cores. I'm not saying you
cannot use OpenMP threads on HT but any performance claim is mostly
hardware related not software.
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory


Christoph Wolf <wolf.christoph at qns.science> 于2019年3月1日周五 上午4:14写道:

> Dear all,
>
> please forgive this "beginner" question but I am facing a weird problem.
> When compiling qe-6.4 (intel compiler, intel MPI+OpenMP) with or without
> intel's fftw libs I find that in openMP with 2 threads per core the intel
> fftw version is roughly "twice as slow" as the internal one
>
> "internal"
>      General routines
>      calbec       :      2.69s CPU      2.70s WALL (     382 calls)
>      fft          :      0.47s CPU      0.47s WALL (     122 calls)
>      ffts         :      0.05s CPU      0.05s WALL (      12 calls)
>      fftw         :     49.97s CPU     50.12s WALL (   14648 calls)
>
>      Parallel routines
>
>      PWSCF        :  1m45.03s CPU     1m46.59s WALL
>
> "intel fftw"
>      General routines
>      calbec       :      6.36s CPU      3.20s WALL (     382 calls)
>      fft          :      0.93s CPU      0.47s WALL (     121 calls)
>      ffts         :      0.10s CPU      0.05s WALL (      12 calls)
>      fftw         :    109.63s CPU     55.23s WALL (   14648 calls)
>
>      Parallel routines
>
>      PWSCF        :   3m18.32s CPU   1m41.01s WALL
>
> as a benchmark I am running a perovskite with 120 k-points on 30
> processors (one node); There is no (noticeable) difference if I export
> OMP_NUM_THREADS=1 (only MPI) so I guess I made some mistake during the
> build with regards to the libraries.
>
> Build process is as below
>
> module load intel19/compiler-19
>
> module load intel19/impi-19
>
>
> export FFT_LIBS="-L$MKLROOT/intel64"
>
> export LAPACK_LIBS="-lmkl_blacs_intelmpi_lp64"
>
> export CC=icc FC=ifort F77=ifort MPIF90=mpiifort MPICC=mpiicc
>
>
> ./configure --enable-parallel --with-scalapack=intel --enable-openmp
>
>
> This detects BLAS_LIBS, LAPACK_LIBS, SCALAPACK_LIBS and FFT_LIBS.
>
> I am not experienced with benchmarking so if my benchmark is garbage
> please suggest a suitable system!
>
> Thanks in advance!
> Chris
>
> --
> Postdoctoral Researcher
> Center for Quantum Nanoscience, Institute for Basic Science
> Ewha Womans University, Seoul, South Korea
>
> _______________________________________________
> users mailing list
> users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20190302/79445bcc/attachment.html>


More information about the users mailing list