<div dir="ltr"><div><div><div><div><div><div><div>Hi Ryan,<br><br></div> Thank you for sending me this information. Here are a few suggestions / observations:<br><br></div> 1. Even with the improved scaling of QE 6.1, your system is small for running on 8192 processors. To put this in perspective, I currently have some production calculations involving roughly 200-300 atoms (many of which are heavy metals), with ecutwfc = 80.0, and I find that running on 8192 KNL cores is fairly reasonable for my systems. I would be surprised if your C60 calculation scales very well beyond ~1000 cores.<br><br></div> 2. From your script, you appear to be running in pure MPI mode, with no OpenMP threading. When trying to run QE at scale it is usually a good idea to run with some threads. I usually use 8 threads per MPI task, but I am using some improvements to the OpenMP threading that have not yet been incorporated into the public release, so you might be better off with only 2 or 4. In addition to possibly improving your timings, this should also help push the "No plane waves found" error to larger core counts.<br></div><div><br></div> 3. Even if you can't scale your runs to 8192 processors, you can generally bundle jobs (i.e., include multiple run commands in the same script that are executed simultaneously). If you don't know how to do this, you should be able to find some helpful resources online.<br><br></div> 4. Another way you might be able to improve your timings is by reducing ecutfock. By default, ecutfock is 4*ecutwfc, which tends to be larger than it really needs to be. I find that setting ecutfock to 1.5*ecutwfc often leads to negligible loss of accuracy. Run some tests with your system(s), and you may find that this works for you as well.<br><br></div> 5. Like I said before, it is worthwhile to test different values of -ntg when using QE 6.1. This will have a more pronounced effect after you turn ACE on.<br><br></div> If you do all of the above, I think it should be possible to run your C60 calculation in well under an hour.<br><div><div><div><div><div><br></div><div>Best,<br></div><div>Taylor<br></div></div></div></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jul 12, 2017 at 11:50 AM, Ryan McAvoy <span dir="ltr"><<a href="mailto:mcavor11@gmail.com" target="_blank">mcavor11@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Thank you Taylor and Paulo for responding,<div><br></div><div>I apologize for not seeing your responses earlier but I am not a part of mailing list yet so I didn't see either of your responses until one of my colleagues forwarded them to me. Taylor, to answer your questions in order,</div><div><br></div><div><ol><li>My production runs at the moment are reasonably small but I expect them to get bigger later. The system I was testing on is C60 (attached). The reason I use 8000+ processors is I plan on running jobs on Mira eventually which has a minimum job size of 8192 processors (512 nodes) and I was testing the capabilities on Cetus which is Mira's debug cluster. Time on Cetus is free but the jobs can't be longer than an hour.</li><li>I didn't run many tests on 6.0 because my 6.0 results didn't finish in the hour allotted with 32 bandgroups and I heard 6.1 scaled better.</li><li>I don't have exact numbers but with 32 bandgroups on Espresso 6.0 I got through a couple of hybrid outer-iterations with 8192 procs in one hour and about the same number of iterations took a few hours on 320 cores on a local Intel machine on 6.1. </li><li>I will try it with ACE turned on then.</li></ol><div>Best,</div></div><div>Ryan</div><div><br></div><div><br></div><div><br></div><div><br></div></div>
</blockquote></div><br></div>