[QE-users] Large (and seemingly random) differences between CPU and WALL time

Thu Jun 6 14:58:59 CEST 2019

Hello

it is a strange behavior which does not depend on the program, there may 
be many reasons, it's very hard to guess:

starting from the most trivial things:

* it  could be that some other application is using the same processors 
as you at the same time ?

* you are using a file system that is not very efficient, e.g. you are 
using the home filesystem instead of the scratch disk or something of 
the kind.

* you are using multithreading but you don't have enough processors to 
do that ? Try to set the envinroment variable OMP_NUM_THREADS to 1 
before running.

I hope it helps

Pietro

On 06/06/19 13:21, Julien Barbaud wrote:
>
> Dear users,
>
> I am still struggling to understand the parallel performances of QE on 
> the cluster of my university. I have to say right off the bat that 
> this problem might have more to do with the parallel scheduling in our 
> cluster. However, after many discussions with the people responsible 
> for the cluster, they don’t seem to see where the problem would be on 
> their side. So I want to check if that could be a more common problem 
> and if you would have some suggestions about it.
>
> The problem in a nutshell: the performance of a pw.x run seems 
> completely random on our cluster. Launching the same job on the same 
> number of procs can result in calculation times differing by a factor 
> of 5 of more. This is of course a huge issue in planning how many 
> cores I want to use, or just trying to have a clue of what’s going on.
>
> When the speed is particularly low, it seems to be materialized by a 
> WALL time much higher than the CPU time.
>
> To exemplify, here is the same code ran on 3, 6 and 9 cores, with the 
> corresponding CPU and WALL time:
>
> Procs     CPU time             WALL time
>
> ------- ------------            -------------
>
> 3 6m56.69s             28m33.48s àbig difference: bad parallelization
>
> 6 4m 9.56s              4m20.65s àgood parallelization
>
> 9 5min42s               21m13.10s àbad parallelization
>
> The huge difference between CPU time and WALL time is an issue. But 
> even looking at the CPU time alone, it doesn’t seem to scale well, as 
> I would not expect the 9 cores to be slower than the 6 (but I lack 
> experience on this).
>
> If I launch the job again right after on 6 cores, I get something much 
> slower. This pattern shows up for different inputs, so I does not seem 
> to be related to that directly. The example is from a vc-relax run 
> stopped after 4 iterations
>
> This all feels very random, but do you have an idea why this would 
> happen ? Am I doing something wrong ?
>
> Another example with a run on 3 iterations, for 3,6,9 procs, repeated 
> twice to show the “random” variations between 2 runs:
>
> Procs     CPU time             WALL time
>
> ------- ------------            -------------
>
> 3  6m25.61s            16m17.82s
>
> 6 3m18.12s             7m16.88s
>
> 9 2m31.85s             6m32.46s 10s
>
> Procs     CPU time             WALL time
>
> ------- ------------            -------------
>
> 3  7m17.83s            22m53.90s
>
> 6 3m42.18s             3m50.74s
>
> 9 5m38.31               9m21.52s
>
> Thanks in advance,
>
> Julien
>
>
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20190606/fde43b81/attachment.html>