[QE-users] Sub optimal performance on 32 core AMD machine

Tue Nov 17 19:24:04 CET 2020

Michal

I have a very similar use-case and looked into many of the same issues when
I got my Threadripper 3960X system at the beginning of the year to
supplement my old dual-Xeon setup. In the past few days I've been
revisiting compilation as I got hold of a Quadro GV100 for GPU acceleration
of my optimizations.

Basically it seems as though code compiled for Zen2 either can't handle
code compiled for both MPI and OpenMP at all, or does so poorly even when
it runs.
Best performance for pw.x on v6.5 (I've been playing with GIPAW and there's
no 6.6 compatible version yet) has been with a simple gcc OpenMPI
compilation without openmp threading and with about 20 MPI cores on my 24
core CPU. Compiling with GCC or PGI compiler made little difference,
although only the more recent PGI compilers will have zen2 optimization.
I get little benefit from Intel MKL over openblas/lapack/fftw3 even with
the debug tweaks, etc.
Puget Systems numbers with other programs suggest that OpenMP only performs
better than OpenMPI with Threadripper but I find the opposite with QE.
I did try disabling hyperthreading in the BIOS but that made no difference
to the performance.

GPU compilation really shows the issue with MPI/OpenMP clashing. With the
Xeons I could compile code with MKL that would run well on a Quadro K6000
while offloading to the CPU with MPI when needed. It could still be a
compiler issue (have to use PGI with the GPU version) but it just doesn't
work with the 3960X, and some things don't thread well with pure OpenMP
(e..g dftd3 versus dftd2) so I'll still need to use separately compiled
versions of 6.5 for different problems.

BTW with a dual CPU system you may benefit from pinning threads to
particular CPUs - it works on the dual Xeon in any case. My Threadripper
balances the load across the cores in a pretty dynamic manner and that's on
a single socket.

Best regards
Pam Whitfield

Independent Consultant

Message: 1
Date: Mon, 16 Nov 2020 15:19:04 +0100
From: Michal Husak <Michal.Husak at vscht.cz>
To: Quantum ESPRESSO users Forum <users at lists.quantum-espresso.org>
Subject: [QE-users] Sub optimal performance on 32 core AMD machine
Message-ID: <fe59d3a8-2ace-4f66-a5c6-e83b01387c61 at cln92.vscht.cz>
Content-Type: text/plain; charset="UTF-8"; format=flowed

I had purchased a new PC with 2x 16 core AMD EPYC processors . 64
cores with hyper threading ...
I was hoping my QM programs (Quantum Espresso, CASTEP) will run on the new
system faster, than on my old 4 core i7 Intel machine (8 year old) ....

To my great surprise, the opposite is almost true :-(.
My main task is scf and geometry optimization of middle sized organic
molecular crystals (abut 100 C,H,N per unit cell) ...

I was playing with OpenMPI/OpenMP setup changes ...
I was playing with the secret MKL_DEBUG_CPU_TYPE=5 parameter
(responsible for slow run of Intel MKL compiled code on AMD) ...

Nothing helps, the best speed is obteined when I  use only 4 cores
(OpenMPI or OpenMP - results similar) ...
Using 16 or 32 cores gives almost no benefit ...
The CPU load for run on 1/4/816/32 coresponds to the nubmer of CPU
set = they try to do something ...

Any idea what I should check, try optimize ?

Maybe the bottleneck is memory access, not CPU power  (I have 128
GB  almost not used RAM) ?

Michal Husak

UCT Prague
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20201117/2555afe3/attachment.html>