[QE-users] DFPT getting stuck [MPI_ERR_TRUNCATE]

Wed May 20 11:07:25 CEST 2020

Yes, I changed it recently:

https://gitlab.com/QEF/q-e/-/commit/334e70c7c6c61f5a16fc5d9027fed52bcf0ffdcf
I was fed up with automated tests crashing if k-point parallelization was
used.

Paolo

On Wed, May 20, 2020 at 10:54 AM Lorenzo Paulatto <paulatz at gmail.com> wrote:

> You are right, in the sense that now the code just writes
> "suboptimal parallelization: some nodes have no k-points"
> I'm quite sure I remember the code stopping because it was run with more
> pools than k-points, was this changed recently, Paolo?
>
>
> On 5/20/20 10:34 AM, M.J. Hutcheon wrote:
> > Dear Lorenzo,
> >
> >> I'm quite sure that the pw code stops if you try to run with more
> >> pools than k-points !
> >>
> > This doesn't seem to be the case? I ran a vc-relax and an scf (attached
> > output) with these (terrible) parallelism settings, and they ran just
> fine.
> >
> > Best,
> >
> > Michael
> >
> >
> > On 2020-05-20 09:25, Lorenzo Paulatto wrote:
> >
> >>> While I am quite sure that such a wasteful parallelization works
> >>> anyway for the self-consistent code,
> >>
> >> I'm quite sure that the pw code stops if you try to run with more
> >> pools than k-points !
> >>
> >>> I am not equally sure it will for the phonon code.
> >>
> >> If the ph code does not stop in this case, I'm confident it will not
> >> work properly!
> >>
> >> cheers
> >>
> >>> It isn't presumably difficult to fix it, but I would move to a more
> >>> sensible parallelization. For 20 k points and 32 processors, I would
> >>> try 4 pools of 8 processors (mpirun -np 32
> >>>   ph.x -nk 4 ...)
> >>> Paolo
> >>>
> >>> On Tue, May 19, 2020 at 2:12 PM M.J. Hutcheon <mjh261 at cam.ac.uk
> >>> <mailto:mjh261 at cam.ac.uk> <mailto:mjh261 at cam.ac.uk
> >>> <mailto:mjh261 at cam.ac.uk>>> wrote:
> >>>
> >>>     Dear QE users/developers,
> >>>
> >>>     Following from the previous request, I've changed to a newer MPI
> >>>
>     library which gives a little more error information, specifically it
> >>>     does now crash with the following message:
> >>>
> >>>     An error occurred in MPI_Allreduce
> >>>     eported by process [1564540929,0]
> >>>     on communicator MPI COMMUNICATOR 6 SPLIT FROM 3
> >>>     MPI_ERR_TRUNCATE: message truncated
> >>>
>     MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> >>>     and potentially your MPI job)
> >>>
> >>>     It appears that this is thrown at the end of a self-consistent DFPT
> >>>     calculation (see the attached output file - it appears the final
> >>>     iteration has converged). I'm using the development version of QE,
> >>>     so I suspect that the error arises from somewhere inside
> >>> https://gitlab.com/QEF/q-e/-/blob/develop/PHonon/PH/solve_linter.f90.
> >>>
> >>>     I don't really know how to debug/workaround this further, any
> >>>     ideas/suggestions would be most welcome.
> >>>
> >>>     Best,
> >>>
> >>>     Michael Hutcheon
> >>>
> >>>     TCM group, University of Cambridge
> >>>
> >>>
> >>>
> >>>     On 2020-05-12 13:29, M.J. Hutcheon wrote:
> >>>
> >>>>     Dear QE users/developers,
> >>>>
> >>>>     I am running an electron-phonon coupling calculation at the gamma
> >>>>     point for a large unit cell Calcium-Hydride (Output file
> >>>>     attached). The calculation appears to get stuck during the DFPT
> >>>>     stage. It does not crash, or produce any error files/output of any
> >>>>     sort, or run out of walltime, but the calculation does not
> >>>>     progress either. I have tried different parameter sets (k-point
> >>>>     grids + cutoffs), which changes the representation where the
> >>>>     calculation gets stuck, but it still gets stuck. I don't really
> >>>>     know what to try next, short of compiling QE in debug mode and
> >>>>     running under a debugger to see where it gets stuck. Any ideas
> >>>>     before I head down this laborious route?
> >>>>
> >>>>     Many thanks,
> >>>>
> >>>>     Michael Hutcheon
> >>>>
> >>>>     TCM group, University of Cambridge
> >>>>
> >>>
> >>>     _______________________________________________
> >>>     Quantum ESPRESSO is supported by MaX
> >>>     (www.max-centre.eu/quantum-espresso
> >>> <http://www.max-centre.eu/quantum-espresso>
> >>>     <http://www.max-centre.eu/quantum-espresso>)
> >>>     users mailing list users at lists.quantum-espresso.org
> >>> <mailto:users at lists.quantum-espresso.org>
> >>>     <mailto:users at lists.quantum-espresso.org
> >>> <mailto:users at lists.quantum-espresso.org>>
> >>> https://lists.quantum-espresso.org/mailman/listinfo/users
> >>>
> >>>
> >>>
> >>> -- Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> >>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> >>> Phone +39-0432-558216, fax +39-0432-558222
> >>>
> >>>
> >>> _______________________________________________
> >>> Quantum ESPRESSO is supported by MaX (
> www.max-centre.eu/quantum-espresso
> >>> <http://www.max-centre.eu/quantum-espresso>)
> >>> users mailing list users at lists.quantum-espresso.org
> >>> <mailto:users at lists.quantum-espresso.org>
> >>> https://lists.quantum-espresso.org/mailman/listinfo/users
> >>>
> >
>
> --
> Lorenzo Paulatto - Paris
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users

-- 
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20200520/c3510e04/attachment.html>