[QE-users] DFPT getting stuck [MPI_ERR_TRUNCATE]

Wed May 20 10:02:56 CEST 2020

Dear Paolo, 

Thank you very much for your response, that was indeed what was causing
the error. 

During my investigation, I noticed that an error check in
PHonon/PH/solve_linter.f90 should have been firing, but wasn't; I've
submitted a fix on GitLab
https://gitlab.com/QEF/q-e/-/merge_requests/941/diffs. 

Best, 

Michael 

On 2020-05-19 17:44, Paolo Giannozzi wrote:

> If I understand correctly, you are parallelizing over k points with 32 processors, but you have just 20 k points. As a consequence, in all loops over k-points, 12 processors will do nothing. While I am quite sure that such a wasteful parallelization works anyway for the self-consistent code, I am not equally sure it will for the phonon code. It isn't presumably difficult to fix it, but I would move to a more sensible parallelization. For 20 k points and 32 processors, I would try 4 pools of 8 processors (mpirun -np 32 
> ph.x -nk 4 ...) 
> Paolo 
> 
> On Tue, May 19, 2020 at 2:12 PM M.J. Hutcheon <mjh261 at cam.ac.uk> wrote: 
> 
> Dear QE users/developers, 
> 
> Following from the previous request, I've changed to a newer MPI library which gives a little more error information, specifically it does now crash with the following message: 
> 
> An error occurred in MPI_Allreduce
> eported by process [1564540929,0]
> on communicator MPI COMMUNICATOR 6 SPLIT FROM 3
> MPI_ERR_TRUNCATE: message truncated
> MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) 
> 
> It appears that this is thrown at the end of a self-consistent DFPT calculation (see the attached output file - it appears the final iteration has converged). I'm using the development version of QE, so I suspect that the error arises from somewhere inside https://gitlab.com/QEF/q-e/-/blob/develop/PHonon/PH/solve_linter.f90. 
> 
> I don't really know how to debug/workaround this further, any ideas/suggestions would be most welcome. 
> 
> Best, 
> 
> Michael Hutcheon 
> 
> TCM group, University of Cambridge 
> 
> On 2020-05-12 13:29, M.J. Hutcheon wrote: 
> 
> Dear QE users/developers, 
> 
> I am running an electron-phonon coupling calculation at the gamma point for a large unit cell Calcium-Hydride (Output file attached). The calculation appears to get stuck during the DFPT stage. It does not crash, or produce any error files/output of any sort, or run out of walltime, but the calculation does not progress either. I have tried different parameter sets (k-point grids + cutoffs), which changes the representation where the calculation gets stuck, but it still gets stuck. I don't really know what to try next, short of compiling QE in debug mode and running under a debugger to see where it gets stuck. Any ideas before I head down this laborious route? 
> 
> Many thanks, 
> 
> Michael Hutcheon 
> 
> TCM group, University of Cambridge 
> 
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso [1])
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users

-- 

Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222

_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso
[1])
users mailing list users at lists.quantum-espresso.org
https://lists.quantum-espresso.org/mailman/listinfo/users 

Links:
------
[1] http://www.max-centre.eu/quantum-espresso
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20200520/96369ad9/attachment.html>