[QE-users] MD runs out of memory with increasing number of cores
Paolo Giannozzi
p.giannozzi at gmail.com
Thu Jun 24 09:33:38 CEST 2021
It's not easy. There is a trick (see dev-tools/mem_counter) to track the
allocated memory but it requires some recompilation (possibly after some
tweaking) and says something only on memory allocated with Fortran
"allocate". And of course, I have tried it and it does not show anything
suspicious. Otherwise: monitor memory usage with "memstat"
Paolo
On Wed, Jun 23, 2021 at 9:23 PM Lenz Fiedler <fiedler.lenz at gmail.com> wrote:
> Dear Prof. Giannozzi,
>
> ah, I understand, that makes sense. Do you have any advice on how to best
> track such a memory leak down in this case? The behavior is reproducible
> with my setup.
>
> Kind regards
> Lenz
>
>
>
> Am Mi., 23. Juni 2021 um 14:02 Uhr schrieb Paolo Giannozzi <
> p.giannozzi at gmail.com>:
>
>> On Wed, Jun 23, 2021 at 1:31 PM Lenz Fiedler <fiedler.lenz at gmail.com>
>> wrote:
>>
>> (and the increase in number of processes is most likely the reason for
>>> the error, if I understand you correctly?)
>>>
>>
>> not exactly. Too many processes may result in too much global memory
>> usage, because some arrays are replicated on each process. If you exceed
>> the global available memory, the code will crash. BUT: it will do so during
>> the first MD step, not after 2000 MD steps. The memory usage should not
>> increase with the number of MD time steps. If it does, there is a memory
>> leak, either in the code or somewhere else (libraries etc).
>>
>> Paolo
>>
>> this is not a problem. For my Beryllium calculation it is more
>>> problematic since the 144 processors case really gives the best performance
>>> (I have uploaded a file called performance_Be128.png to show my timing
>>> results), but I still run out of memory after 2700 time steps. Although
>>> this is also manageable, since I can always restart the calculation and
>>> perform another 2700 time steps. With this I was able to perform 10.000
>>> time steps in just over a day. I am running more calculations on larger Be
>>> and Fe cells and I will investigate this behavior there.
>>>
>>> I have also used the "gamma" option for the K-points to use the
>>> performance benefits you outlined. For the Fe128 cell, I achieved optimal
>>> performance with 144 processors and using the "gamma" option (resulting in
>>> about 90s per SCF cycle). I am still not within my personal target of ~30s
>>> per SCF cycle but I will start looking into the choice of my PSP and cutoff
>>> (along with considering OpenMP and task group parallelization) rather than
>>> blindly throwing more and more processors at the problem.
>>>
>>> Kind regards
>>> Lenz
>>>
>>> PhD Student (HZDR / CASUS)
>>>
>>>
>>> Am Sa., 19. Juni 2021 um 09:25 Uhr schrieb Paolo Giannozzi <
>>> p.giannozzi at gmail.com>:
>>>
>>>> I tried your Fe job on a 36-core machine (with Gamma point to save time
>>>> and memory) and found no evidence of memory leaks after more than 100 steps.
>>>>
>>>> The best performance I was able to achieve so far was with 144 cores
>>>>> defaulting to -nb 144, so am I correct to assume that I should try e.g. -nb
>>>>> 144 -ntg 2 for 288 cores?
>>>>>
>>>>
>>>> You should not use option -nb except in some rather special cases.
>>>>
>>>> Paolo
>>>>
>>>>
>>>> PhD Student (HZDR / CASUS)
>>>>>
>>>>> Am Mi., 16. Juni 2021 um 07:33 Uhr schrieb Paolo Giannozzi <
>>>>> p.giannozzi at gmail.com>:
>>>>>
>>>>>> Hard to say without knowing exactly what goes out of which memory
>>>>>> limits. Note that not all arrays are distributed across processors, so a
>>>>>> considerable number of arrays are replicated on all processes. As a
>>>>>> consequence the total amount of required memory will increase with the
>>>>>> number of mpi processes. Also note that a 128-atom cell is not "large" and
>>>>>> 144 cores are not "a small number of processors". You will not get any
>>>>>> advantage by just increasing the number of processors any more, quite the
>>>>>> opposite. If you have too many idle cores, you should consider
>>>>>> - "task group" parallelization (option -ntg)
>>>>>> - MPI+OpenMP parallelization (configure --enable-openmp)
>>>>>> Please also note that ecutwfc=80 Ry is a rather large cutoff for a
>>>>>> USPP (while ecutrho=320 is fine) and that running with K_POINTS Gamma
>>>>>> instead of 1 1 1 0 0 0 will be faster and take less memory.
>>>>>>
>>>>>> Paolo
>>>>>>
>>>>>> On Mon, Jun 14, 2021 at 4:22 PM Lenz Fiedler <fiedler.lenz at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Dear users,
>>>>>>>
>>>>>>> I am trying to perform a MD simulation for a large cell (128 Fe
>>>>>>> atoms, gamma point) using pw.x and I get a strange scaling behavior. To
>>>>>>> test the performance I ran the same MD simulation with an increasing number
>>>>>>> of nodes (2, 4, 6, 8, etc.) using 24 cores per node. The simulation is
>>>>>>> successful when using 2, 4, and 6 nodes, so 48, 96 and 144 cores resp
>>>>>>> (albeit slow, which is within my expectations for such a small number of
>>>>>>> processors).
>>>>>>> Going to 8 and more nodes, I run into an out-of-memory error after
>>>>>>> about two time steps.
>>>>>>> I am a little bit confused as to what could be the reason. Since a
>>>>>>> smaller amount of cores works I would not expect a higher number of cores
>>>>>>> to run without an oom error as well.
>>>>>>> The 8 node run explictly outputs at the beginning:
>>>>>>> " Estimated max dynamical RAM per process > 140.54 MB
>>>>>>> Estimated total dynamical RAM > 26.35 GB
>>>>>>> "
>>>>>>>
>>>>>>> which is well within the 2.5 GB I have allocated for each core.
>>>>>>> I am obviously doing something wrong, could anyone point to what it
>>>>>>> is?
>>>>>>> The input files for a 6 and 8 node run can be found here:
>>>>>>> https://drive.google.com/drive/folders/1kro3ooa2OngvddB8RL-6Iyvdc07xADNJ?usp=sharing
>>>>>>> I am using QE6.6.
>>>>>>>
>>>>>>> Kind regards
>>>>>>> Lenz
>>>>>>>
>>>>>>> PhD Student (HZDR / CASUS)
>>>>>>> _______________________________________________
>>>>>>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>>>>>>> users mailing list users at lists.quantum-espresso.org
>>>>>>> https://lists.quantum-espresso.org/mailman/listinfo/users
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
>>>>>> Univ. Udine, via delle Scienze 206, 33100 Udine, Italy
>>>>>> Phone +39-0432-558216, fax +39-0432-558222
>>>>>>
>>>>>> _______________________________________________
>>>>>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>>>>>> users mailing list users at lists.quantum-espresso.org
>>>>>> https://lists.quantum-espresso.org/mailman/listinfo/users
>>>>>
>>>>> _______________________________________________
>>>>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>>>>> users mailing list users at lists.quantum-espresso.org
>>>>> https://lists.quantum-espresso.org/mailman/listinfo/users
>>>>
>>>>
>>>>
>>>> --
>>>> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
>>>> Univ. Udine, via delle Scienze 206, 33100 Udine, Italy
>>>> Phone +39-0432-558216, fax +39-0432-558222
>>>>
>>>> _______________________________________________
>>>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>>>> users mailing list users at lists.quantum-espresso.org
>>>> https://lists.quantum-espresso.org/mailman/listinfo/users
>>>
>>> _______________________________________________
>>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>>> users mailing list users at lists.quantum-espresso.org
>>> https://lists.quantum-espresso.org/mailman/listinfo/users
>>
>>
>>
>> --
>> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
>> Univ. Udine, via delle Scienze 206, 33100 Udine, Italy
>> Phone +39-0432-558216, fax +39-0432-558222
>>
>> _______________________________________________
>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
>> users mailing list users at lists.quantum-espresso.org
>> https://lists.quantum-espresso.org/mailman/listinfo/users
>
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 206, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210624/13fd0943/attachment.html>
More information about the users
mailing list