[Q-e-developers] Behavior of band-parallelization in QE 6.1

Wed Jul 12 22:09:24 CEST 2017

Hi Ryan,

   Thank you for sending me this information.  Here are a few suggestions /
observations:

   1. Even with the improved scaling of QE 6.1, your system is small for
running on 8192 processors.  To put this in perspective, I currently have
some production calculations involving roughly 200-300 atoms (many of which
are heavy metals), with ecutwfc = 80.0, and I find that running on 8192 KNL
cores is fairly reasonable for my systems.  I would be surprised if your
C60 calculation scales very well beyond ~1000 cores.

   2. From your script, you appear to be running in pure MPI mode, with no
OpenMP threading.  When trying to run QE at scale it is usually a good idea
to run with some threads.  I usually use 8 threads per MPI task, but I am
using some improvements to the OpenMP threading that have not yet been
incorporated into the public release, so you might be better off with only
2 or 4.  In addition to possibly improving your timings, this should also
help push the "No plane waves found" error to larger core counts.

   3. Even if you can't scale your runs to 8192 processors, you can
generally bundle jobs (i.e., include multiple run commands in the same
script that are executed simultaneously).  If you don't know how to do
this, you should be able to find some helpful resources online.

   4. Another way you might be able to improve your timings is by reducing
ecutfock.  By default, ecutfock is 4*ecutwfc, which tends to be larger than
it really needs to be.  I find that setting ecutfock to 1.5*ecutwfc often
leads to negligible loss of accuracy.  Run some tests with your system(s),
and you may find that this works for you as well.

   5. Like I said before, it is worthwhile to test different values of -ntg
when using QE 6.1.  This will have a more pronounced effect after you turn
ACE on.

   If you do all of the above, I think it should be possible to run your
C60 calculation in well under an hour.

Best,
Taylor

On Wed, Jul 12, 2017 at 11:50 AM, Ryan McAvoy <mcavor11 at gmail.com> wrote:

> Thank you Taylor and Paulo for responding,
>
> I apologize for not seeing your responses earlier but I am not a part of
> mailing list yet so I didn't see either of your responses until one of my
> colleagues forwarded them to me. Taylor, to answer your questions in order,
>
>
>    1. My production runs at the moment are reasonably small but I expect
>    them to get bigger later. The system I was testing on is C60 (attached).
>    The reason I use 8000+ processors is I plan on running jobs on Mira
>    eventually which has a minimum job size of 8192 processors (512 nodes) and
>    I was testing the capabilities on Cetus which is Mira's debug cluster. Time
>    on Cetus is free but the jobs can't be longer than an hour.
>    2. I didn't run many tests on 6.0 because my 6.0 results didn't finish
>    in the hour allotted with 32 bandgroups and I heard 6.1 scaled better.
>    3. I don't have exact numbers but with 32 bandgroups on Espresso 6.0 I
>    got through a couple of hybrid outer-iterations with 8192 procs in one hour
>    and about the same number of iterations took a few hours on 320 cores on a
>    local Intel machine on 6.1.
>    4. I will try it with ACE turned on then.
>
> Best,
> Ryan
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20170712/23886e05/attachment.html>