[QE-users] Large (and seemingly random) differences between CPU and WALL time

Julien Barbaud julien_barbaud at sjtu.edu.cn
Thu Jun 6 13:21:33 CEST 2019


Dear users,

 

I am still struggling to understand the parallel performances of QE on the
cluster of my university. I have to say right off the bat that this problem
might have more to do with the parallel scheduling in our cluster. However,
after many discussions with the people responsible for the cluster, they
don't seem to see where the problem would be on their side. So I want to
check if that could be a more common problem and if you would have some
suggestions about it.

 

The problem in a nutshell: the performance of a pw.x run seems completely
random on our cluster. Launching the same job on the same number of procs
can result in calculation times differing by a factor of 5 of more. This is
of course a huge issue in planning how many cores I want to use, or just
trying to have a clue of what's going on.

When the speed is particularly low, it seems to be materialized by a WALL
time much higher than the CPU time.

To exemplify, here is the same code ran on 3, 6 and 9 cores, with the
corresponding CPU and WALL time:

Procs     CPU time             WALL time

-------    ------------            -------------

3              6m56.69s             28m33.48s  --> big difference: bad
parallelization

6              4m 9.56s              4m20.65s    --> good parallelization

9              5min42s               21m13.10s  --> bad parallelization

 

The huge difference between CPU time and WALL time is an issue. But even
looking at the CPU time alone, it doesn't seem to scale well, as I would not
expect the 9 cores to be slower than the 6 (but I lack experience on this). 

If I launch the job again right after on 6 cores, I get something much
slower. This pattern shows up for different inputs, so I does not seem to be
related to that directly. The example is from a vc-relax run stopped after 4
iterations

 

This all feels very random, but do you have an idea why this would happen ?
Am I doing something wrong ?

 

Another example with a run on 3 iterations, for 3,6,9 procs, repeated twice
to show the "random" variations between 2 runs:

 

Procs     CPU time             WALL time

-------    ------------            -------------

3               6m25.61s            16m17.82s 

6              3m18.12s             7m16.88s

9              2m31.85s             6m32.46s 10s  

 

 

Procs     CPU time             WALL time

-------    ------------            -------------

3               7m17.83s            22m53.90s

6              3m42.18s             3m50.74s

9              5m38.31               9m21.52s

 

 

 

Thanks in advance, 

Julien

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20190606/6f044098/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FAPbI2SCN.vc-relax.test.in
Type: application/octet-stream
Size: 7249 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20190606/6f044098/attachment.obj>


More information about the users mailing list