[Pw_forum] Time dependent execution time using 32 core on single node

Edoardo Mosconi edoblasco at gmail.com
Thu Jun 22 13:47:49 CEST 2017


Dear QE developers,

I have a problem running pw.x. The following architecture is our new
machines in Perugia and we are testing QE on it. This is the first time I’m
using this 32 cores Xeon processors.

I tested 6.0 and 6.1 version obtaining the same problem.

Machine: 2 x Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz with 256Gb of RAM
with 16 proc. For a total of 32 proc per node.

Scientific Linux release 6.9 (Carbon)

Compiler: Intel 14.0.2 with MKL

MPI = mvapich2/1.9 or 2.1

SCALAPACK= scalapack/2.0.2 or intel

OFED 318.


Running the same input file more than 20 times using 32 proc on a single
node, I obtained a inconsistency with execution time:

The same behavior is found with different inputs: large system (1600
electrons), small systems (100 electrons), single water molecules,
perovskite, isoloted molecules…

I tried several systems, different input file, linking different library,
and also several possible combination for compiling. But sometimes the same
calculation takes 3, 4 or 5 times more.

So the problem does not depend to the system and input file.

By performing the same calculation using 16 or 8 or 4 or 1 proc on the same
machine, I have a perfect consistency. Only using 32 cores on the same node
gives the problem sometimes.

This is found only while using no qe parallelizations. For example, for
gamma point calculation using -npool 1 on a single node:

mpirun_rsh -rsh -np 32 -hostfile ./HOST pw.x -i <input.file>


grep "electrons :" scatter_*

scatter_01: electrons : 81.41s CPU 83.85s WALL ( 1 calls)

scatter_02: electrons : 81.01s CPU 83.35s WALL ( 1 calls)

scatter_03: electrons : 81.05s CPU 83.50s WALL ( 1 calls)

scatter_04: electrons : 81.00s CPU 83.45s WALL ( 1 calls)

scatter_05: electrons : 81.00s CPU 83.36s WALL ( 1 calls)

scatter_06: electrons : 80.85s CPU 83.22s WALL ( 1 calls)

scatter_07: electrons : 80.60s CPU 82.95s WALL ( 1 calls)

scatter_08: electrons : 81.05s CPU 83.65s WALL ( 1 calls)

scatter_09: electrons : *134.38*s CPU 136.73s WALL ( 1 calls)

scatter_10: electrons : 82.78s CPU 85.22s WALL ( 1 calls)

scatter_11: electrons : 80.45s CPU 82.79s WALL ( 1 calls)

scatter_12: electrons : 82.62s CPU 85.09s WALL ( 1 calls)

scatter_13: electrons : 81.52s CPU 84.06s WALL ( 1 calls)

scatter_14: electrons : *360.48*s CPU 491.89s WALL ( 1 calls)


I also try to run the same input with a QE compiled only with the internal
library: It seems to have the same problem with a lower difference, but the
calculation time is very sloow (more than 10 times):

PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2799.tccw.mpd:     electrons    :
800.04s CPU    802.52s WALL (       1 calls)
PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2800.tccw.mpd:     electrons    :
1223.21s CPU   1298.01s WALL (       1 calls)
PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2801.tccw.mpd:     electrons    :
822.31s CPU    825.85s WALL (       1 calls)
PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2802.tccw.mpd:     electrons    :
794.77s CPU    797.40s WALL (       1 calls)
PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2966.tccw.mpd:     electrons    :
824.84s CPU    827.58s WALL (       1 calls)
PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2967.tccw.mpd:     electrons    :
818.59s CPU    821.14s WALL (       1 calls)
PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2968.tccw.mpd:     electrons    :
797.59s CPU    800.17s WALL (       1 calls)
PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2974.tccw.mpd:     electrons    :
798.30s CPU    800.69s WALL (       1 calls)
PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2975.tccw.mpd:     electrons    :
827.15s CPU    829.68s WALL (       1 calls)
PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2983.tccw.mpd:     electrons    :
1049.27s CPU   1140.57s WALL (       1 calls)
PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2984.tccw.mpd:     electrons    :
895.84s CPU    899.82s WALL (       1 calls)
PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2985.tccw.mpd:     electrons    :
799.89s CPU    802.22s WALL (       1 calls)
PbI.ROT-DYN.2x2.383_I_ap.scf.out_ace_2986.tccw.mpd:     electrons    :
796.46s CPU    799.25s WALL (       1 calls)

When I have kpoints and I’m using -npool 1 the problem remains.

I can solve the problem using -npool parallelization when is possible (when
I have kpoints):

When I have kpoints and I’m using -npool 2 or more, the calculation time is
consistent (40 runs gives the same execution time).


So the problem appears when I’m planning to calculate medium or big systems
at gamma point.


This sounds like something “time dependent” when using 32 procs with -npool
1 in qe parallelization. I tried -nt parallelization but It doesn't help.


I used for many years QE in our old cluster with 2 x Intel(R) Xeon(R) CPU
E5-2670 0 @ 2.60GHz (2 x 8 core machine, intel compiler 11, mkl and
scalapack intel) and I never found this problem. Maybe seems related only
when going to 32 proc per node.


Have you got suggestions to fix this problem?


Thank you in advance,


Edoardo Mosconi

CNR-ISTM Perugia. Via Elce di Sotto, 8. 06123 Perugia (Italy)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20170622/931c74c9/attachment.html>


More information about the users mailing list