[QE-users] Large (and seemingly random) differences between CPU and WALL time
Pietro Delugas
pdelugas at sissa.it
Thu Jun 6 14:58:59 CEST 2019
Hello
it is a strange behavior which does not depend on the program, there may
be many reasons, it's very hard to guess:
starting from the most trivial things:
* it could be that some other application is using the same processors
as you at the same time ?
* you are using a file system that is not very efficient, e.g. you are
using the home filesystem instead of the scratch disk or something of
the kind.
* you are using multithreading but you don't have enough processors to
do that ? Try to set the envinroment variable OMP_NUM_THREADS to 1
before running.
I hope it helps
Pietro
On 06/06/19 13:21, Julien Barbaud wrote:
>
> Dear users,
>
> I am still struggling to understand the parallel performances of QE on
> the cluster of my university. I have to say right off the bat that
> this problem might have more to do with the parallel scheduling in our
> cluster. However, after many discussions with the people responsible
> for the cluster, they don’t seem to see where the problem would be on
> their side. So I want to check if that could be a more common problem
> and if you would have some suggestions about it.
>
> The problem in a nutshell: the performance of a pw.x run seems
> completely random on our cluster. Launching the same job on the same
> number of procs can result in calculation times differing by a factor
> of 5 of more. This is of course a huge issue in planning how many
> cores I want to use, or just trying to have a clue of what’s going on.
>
> When the speed is particularly low, it seems to be materialized by a
> WALL time much higher than the CPU time.
>
> To exemplify, here is the same code ran on 3, 6 and 9 cores, with the
> corresponding CPU and WALL time:
>
> Procs CPU time WALL time
>
> ------- ------------ -------------
>
> 3 6m56.69s 28m33.48s àbig difference: bad parallelization
>
> 6 4m 9.56s 4m20.65s àgood parallelization
>
> 9 5min42s 21m13.10s àbad parallelization
>
> The huge difference between CPU time and WALL time is an issue. But
> even looking at the CPU time alone, it doesn’t seem to scale well, as
> I would not expect the 9 cores to be slower than the 6 (but I lack
> experience on this).
>
> If I launch the job again right after on 6 cores, I get something much
> slower. This pattern shows up for different inputs, so I does not seem
> to be related to that directly. The example is from a vc-relax run
> stopped after 4 iterations
>
> This all feels very random, but do you have an idea why this would
> happen ? Am I doing something wrong ?
>
> Another example with a run on 3 iterations, for 3,6,9 procs, repeated
> twice to show the “random” variations between 2 runs:
>
> Procs CPU time WALL time
>
> ------- ------------ -------------
>
> 3 6m25.61s 16m17.82s
>
> 6 3m18.12s 7m16.88s
>
> 9 2m31.85s 6m32.46s 10s
>
> Procs CPU time WALL time
>
> ------- ------------ -------------
>
> 3 7m17.83s 22m53.90s
>
> 6 3m42.18s 3m50.74s
>
> 9 5m38.31 9m21.52s
>
> Thanks in advance,
>
> Julien
>
>
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20190606/fde43b81/attachment.html>
More information about the users
mailing list