[QE-users] DFPT getting stuck [MPI_ERR_TRUNCATE]

Lorenzo Paulatto paulatz at gmail.com
Wed May 20 10:54:20 CEST 2020


You are right, in the sense that now the code just writes
"suboptimal parallelization: some nodes have no k-points"
I'm quite sure I remember the code stopping because it was run with more 
pools than k-points, was this changed recently, Paolo?


On 5/20/20 10:34 AM, M.J. Hutcheon wrote:
> Dear Lorenzo,
> 
>> I'm quite sure that the pw code stops if you try to run with more 
>> pools than k-points !
>>
> This doesn't seem to be the case? I ran a vc-relax and an scf (attached 
> output) with these (terrible) parallelism settings, and they ran just fine.
> 
> Best,
> 
> Michael
> 
> 
> On 2020-05-20 09:25, Lorenzo Paulatto wrote:
> 
>>> While I am quite sure that such a wasteful parallelization works 
>>> anyway for the self-consistent code,
>>
>> I'm quite sure that the pw code stops if you try to run with more 
>> pools than k-points !
>>
>>> I am not equally sure it will for the phonon code. 
>>
>> If the ph code does not stop in this case, I'm confident it will not 
>> work properly!
>>
>> cheers
>>
>>> It isn't presumably difficult to fix it, but I would move to a more 
>>> sensible parallelization. For 20 k points and 32 processors, I would 
>>> try 4 pools of 8 processors (mpirun -np 32
>>>   ph.x -nk 4 ...)
>>> Paolo
>>>
>>> On Tue, May 19, 2020 at 2:12 PM M.J. Hutcheon <mjh261 at cam.ac.uk 
>>> <mailto:mjh261 at cam.ac.uk> <mailto:mjh261 at cam.ac.uk 
>>> <mailto:mjh261 at cam.ac.uk>>> wrote:
>>>
>>>     Dear QE users/developers,
>>>
>>>     Following from the previous request, I've changed to a newer MPI
>>>     library which gives a little more error information, specifically it
>>>     does now crash with the following message:
>>>
>>>     An error occurred in MPI_Allreduce
>>>     eported by process [1564540929,0]
>>>     on communicator MPI COMMUNICATOR 6 SPLIT FROM 3
>>>     MPI_ERR_TRUNCATE: message truncated
>>>     MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>     and potentially your MPI job)
>>>
>>>     It appears that this is thrown at the end of a self-consistent DFPT
>>>     calculation (see the attached output file - it appears the final
>>>     iteration has converged). I'm using the development version of QE,
>>>     so I suspect that the error arises from somewhere inside
>>> https://gitlab.com/QEF/q-e/-/blob/develop/PHonon/PH/solve_linter.f90.
>>>
>>>     I don't really know how to debug/workaround this further, any
>>>     ideas/suggestions would be most welcome.
>>>
>>>     Best,
>>>
>>>     Michael Hutcheon
>>>
>>>     TCM group, University of Cambridge
>>>
>>>
>>>
>>>     On 2020-05-12 13:29, M.J. Hutcheon wrote:
>>>
>>>>     Dear QE users/developers,
>>>>
>>>>     I am running an electron-phonon coupling calculation at the gamma
>>>>     point for a large unit cell Calcium-Hydride (Output file
>>>>     attached). The calculation appears to get stuck during the DFPT
>>>>     stage. It does not crash, or produce any error files/output of any
>>>>     sort, or run out of walltime, but the calculation does not
>>>>     progress either. I have tried different parameter sets (k-point
>>>>     grids + cutoffs), which changes the representation where the
>>>>     calculation gets stuck, but it still gets stuck. I don't really
>>>>     know what to try next, short of compiling QE in debug mode and
>>>>     running under a debugger to see where it gets stuck. Any ideas
>>>>     before I head down this laborious route?
>>>>
>>>>     Many thanks,
>>>>
>>>>     Michael Hutcheon
>>>>
>>>>     TCM group, University of Cambridge
>>>>
>>>
>>>     _______________________________________________
>>>     Quantum ESPRESSO is supported by MaX
>>>     (www.max-centre.eu/quantum-espresso 
>>> <http://www.max-centre.eu/quantum-espresso>
>>>     <http://www.max-centre.eu/quantum-espresso>)
>>>     users mailing list users at lists.quantum-espresso.org 
>>> <mailto:users at lists.quantum-espresso.org>
>>>     <mailto:users at lists.quantum-espresso.org 
>>> <mailto:users at lists.quantum-espresso.org>>
>>> https://lists.quantum-espresso.org/mailman/listinfo/users
>>>
>>>
>>>
>>> -- Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
>>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
>>> Phone +39-0432-558216, fax +39-0432-558222
>>>
>>>
>>> _______________________________________________
>>> Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso 
>>> <http://www.max-centre.eu/quantum-espresso>)
>>> users mailing list users at lists.quantum-espresso.org 
>>> <mailto:users at lists.quantum-espresso.org>
>>> https://lists.quantum-espresso.org/mailman/listinfo/users
>>>
> 

-- 
Lorenzo Paulatto - Paris


More information about the users mailing list