[QE-users] MD runs out of memory with increasing number of cores

Wed Jun 23 13:56:00 CEST 2021

On Wed, Jun 23, 2021 at 1:31 PM Lenz Fiedler <fiedler.lenz at gmail.com> wrote:

 (and the increase in number of processes is most likely the reason for the
> error, if I understand you correctly?)
>

not exactly. Too many processes may result in too much global memory usage,
because some arrays are replicated on each process.  If you exceed the
global available memory, the code will crash. BUT: it will do so during the
first MD step, not after 2000 MD steps. The memory usage should not
increase with the number of MD time steps. If it does, there is a memory
leak, either in the code or somewhere else (libraries etc).

Paolo

this is not a problem. For my Beryllium calculation it is more problematic
> since the 144 processors case really gives the best performance (I have
> uploaded a file called performance_Be128.png to show my timing results),
> but I still run out of memory after 2700 time steps. Although this is also
> manageable, since I can always restart the calculation and perform another
> 2700 time steps. With this I was able to perform 10.000 time steps in just
> over a day. I am running more calculations on larger Be and Fe cells and I
> will investigate this behavior there.
>
> I have also used the "gamma" option for the K-points to use the
> performance benefits you outlined. For the Fe128 cell, I achieved optimal
> performance with 144 processors and using the "gamma" option (resulting in
> about 90s per SCF cycle). I am still not within my personal target of ~30s
> per SCF cycle but I will start looking into the choice of my PSP and cutoff
> (along with considering OpenMP and task group parallelization) rather than
> blindly throwing more and more processors at the problem.
>
> Kind regards
> Lenz
>
> PhD Student (HZDR / CASUS)
>
>
> Am Sa., 19. Juni 2021 um 09:25 Uhr schrieb Paolo Giannozzi <
> p.giannozzi at gmail.com>:
>
>> I tried your Fe job on a 36-core machine (with Gamma point to save time
>> and memory) and found no evidence of memory leaks after more than 100 steps.
>>
>> The best performance I was able to achieve so far was with 144 cores
>>> defaulting to -nb 144, so am I correct to assume that I should try e.g. -nb
>>> 144 -ntg 2 for 288 cores?
>>>
>>
>> You should not use option -nb except in some rather special cases.
>>
>> Paolo
>>
>>
>> PhD Student (HZDR / CASUS)
>>>
>>> Am Mi., 16. Juni 2021 um 07:33 Uhr schrieb Paolo Giannozzi <
>>> p.giannozzi at gmail.com>:
>>>
>>>> Hard to say without knowing exactly what goes out of which memory
>>>> limits. Note that not all arrays are distributed across processors, so a
>>>> considerable number of arrays are replicated on all processes. As a
>>>> consequence the total amount of required memory will increase with the
>>>> number of mpi processes. Also note that a 128-atom cell is not "large" and
>>>> 144 cores are not "a small number of processors". You will not get any
>>>> advantage by just increasing the number of processors any more, quite the
>>>> opposite. If you have too many idle cores, you should consider
>>>> - "task group" parallelization (option -ntg)
>>>> - MPI+OpenMP parallelization (configure --enable-openmp)
>>>> Please also note that ecutwfc=80 Ry is a rather large cutoff for a USPP
>>>> (while ecutrho=320 is fine) and that running with K_POINTS Gamma instead of
>>>> 1 1 1 0 0 0 will be faster and take less memory.
>>>>
>>>> Paolo
>>>>
>>>> On Mon, Jun 14, 2021 at 4:22 PM Lenz Fiedler <fiedler.lenz at gmail.com>
>>>> wrote:
>>>>
>>>>> Dear users,
>>>>>
>>>>> I am trying to perform a MD simulation for a large cell (128 Fe atoms,
>>>>> gamma point) using pw.x and I get a strange scaling behavior. To test the
>>>>> performance I ran the same MD simulation with an increasing number of nodes
>>>>> (2, 4, 6, 8, etc.) using 24 cores per node. The simulation is successful
>>>>> when using 2, 4, and 6 nodes, so 48, 96 and 144 cores resp (albeit slow,
>>>>> which is within my expectations for such a small number of processors).
>>>>> Going to 8 and more nodes, I run into an out-of-memory error after
>>>>> about two time steps.
>>>>> I am a little bit confused as to what could be the reason. Since a
>>>>> smaller amount of cores works I would not expect a higher number of cores
>>>>> to run without an oom error as well.
>>>>> The 8 node run explictly outputs at the beginning:
>>>>> "     Estimated max dynamical RAM per process >     140.54 MB
>>>>>       Estimated total dynamical RAM >      26.35 GB
>>>>> "
>>>>>
>>>>> which is well within the 2.5 GB I have allocated for each core.
>>>>> I am obviously doing something wrong, could anyone point to what it is?
>>>>> The input files for a 6 and 8 node run can be found here:
>>>>> https://drive.google.com/drive/folders/1kro3ooa2OngvddB8RL-6Iyvdc07xADNJ?usp=sharing
>>>>> I am using QE6.6.
>>>>>
>>>>> Kind regards
>>>>> Lenz
>>>>>
>>>>> PhD Student (HZDR / CASUS)
>>>>> _______________________________________________
>>>>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>>>>> users mailing list users at lists.quantum-espresso.org
>>>>> https://lists.quantum-espresso.org/mailman/listinfo/users
>>>>
>>>>
>>>>
>>>> --
>>>> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
>>>> Univ. Udine, via delle Scienze 206, 33100 Udine, Italy
>>>> Phone +39-0432-558216, fax +39-0432-558222
>>>>
>>>> _______________________________________________
>>>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>>>> users mailing list users at lists.quantum-espresso.org
>>>> https://lists.quantum-espresso.org/mailman/listinfo/users
>>>
>>> _______________________________________________
>>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>>> users mailing list users at lists.quantum-espresso.org
>>> https://lists.quantum-espresso.org/mailman/listinfo/users
>>
>>
>>
>> --
>> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
>> Univ. Udine, via delle Scienze 206, 33100 Udine, Italy
>> Phone +39-0432-558216, fax +39-0432-558222
>>
>> _______________________________________________
>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>> users mailing list users at lists.quantum-espresso.org
>> https://lists.quantum-espresso.org/mailman/listinfo/users
>
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users

-- 
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 206, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210623/46df9e6a/attachment.html>