[QE-users] MPI error in pw.x

Paolo Giannozzi p.giannozzi at gmail.com
Mon Dec 10 16:06:16 CET 2018


HI Axel, such a bug was present some time ago but it has been fixed since
(I think).

I cannot reproduce the problem with an old Intel compiler and OpenMPI.

Psolo

On Mon, Dec 10, 2018 at 1:16 PM Axel Kohlmeyer <akohlmey at gmail.com> wrote:

> a careful look at the error message reveals, that you are running out
> of space for MPI communicators for which a fixed maximum number (16k)
> seems to be allowed.
> this hints at a problem somewhere that communicators are generated
> with MPI_Comm_split() and not properly cleared afterwards.
>
> axel.
>
>
> On Mon, Dec 10, 2018 at 6:53 AM Alex.Durie <alex.durie at open.ac.uk> wrote:
> >
> > The problem seems to crop up when a minimum of 8 processors are used. As
> a quick and easily accessible test, I tried it on example08 of the
> Wannier90 examples with the following command;
> >
> >
> > mpirun -np 8 pw.x -i iron.scf > scf.out
> >
> >
> > and the same problem occurred. I am using PWSCF v.6.3 using the Intel
> parallel studio 2016 suite. PW was built using all intel compilers, intel
> MPI and mkl.
> >
> >
> > Many thanks,
> >
> >
> > Alex
> >
> >
> > Date: Sun, 9 Dec 2018 21:26:31 +0100
> > From: Paolo Giannozzi <p.giannozzi at gmail.com>
> > To: Quantum Espresso users Forum <users at lists.quantum-espresso.org>
> > Subject: Re: [QE-users] MPI error in pw.x
> > Message-ID:
> >         <
> CAPMgbCs0VU+GJZJ_TY3cfX+8pGoRVsuF5LvW0k7+fyB+sBZ2Hg at mail.gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> >
> > If it is not a problem of your compiler or mpi libraries, it can only be
> > the usual problem of irreproducibility of results on different
> processors.
> > In order to figure out this, one needs as a strict minimum some
> information
> > on which exact version exhibits the problem, under which exact
> > circumstances (e.g. mpirun -np ... ) and an input that can be run in a
> > reasonable amount of time on a reasonably small machine.
> >
> > Paolo
> > -
> >
> > On Sat, Dec 8, 2018 at 9:55 PM Alex.Durie <alex.durie at open.ac.uk> wrote:
> >
> > > Dear experts,
> > >
> > > I have been running pw.x with multiple processes quite successfully,
> > > however when the number of processes is high enough, such that the
> space
> > > group has more than 7 processes, where the subspace diagonalization no
> > > longer uses a serial algorithm, the program crashes abruptly at about
> the
> > > 10th iteration with the following errors;
> > >
> > > Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
> > > PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,
> > > remain_dims=0x7ffe0b27a6e8, comm_new=0x7ffe0b27a640) failed
> > > PMPI_Cart_sub(178)...................:
> > > MPIR_Comm_split_impl(270)............:
> > > MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384
> > > free on this process; ignore_id=0)
> > > Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
> > > PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,
> > > remain_dims=0x7ffefaee7ce8, comm_new=0x7ffefaee7c40) failed
> > > PMPI_Cart_sub(178)...................:
> > > MPIR_Comm_split_impl(270)............:
> > > MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384
> > > free on this process; ignore_id=0)
> > > Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
> > > PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,
> > > remain_dims=0x7fffc482d168, comm_new=0x7fffc482d0c0) failed
> > > PMPI_Cart_sub(178)...................:
> > > MPIR_Comm_split_impl(270)............:
> > > MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384
> > > free on this process; ignore_id=0)
> > > Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
> > > PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,
> > > remain_dims=0x7ffe23a022e8, comm_new=0x7ffe23a02240) failed
> > > PMPI_Cart_sub(178)...................:
> > > MPIR_Comm_split_impl(270)............:
> > > MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384
> > > free on this process; ignore_id=0)
> > > forrtl: error (69): process interrupted (SIGINT)
> > > Image              PC                Routine            Line
> > > Source
> > > pw.x               0000000000EAAC45  Unknown               Unknown
> Unknown
> > > pw.x               0000000000EA8867  Unknown               Unknown
> Unknown
> > > pw.x               0000000000E3DC64  Unknown               Unknown
> Unknown
> > > pw.x               0000000000E3DA76  Unknown               Unknown
> Unknown
> > > pw.x               0000000000DC41B6  Unknown               Unknown
> Unknown
> > > pw.x               0000000000DCBB2E  Unknown               Unknown
> Unknown
> > > libpthread.so.0    00002BA339B746D0  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002BA3390A345F  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002BA3391AEE39  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002BA3391AEB32  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002BA3390882F9  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002BA339087D5D  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002BA339087BDC  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002BA339087B0C  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002BA339089932  Unknown               Unknown
> > > Unknown
> > > libmpifort.so.12   00002BA338C41B1C  Unknown               Unknown
> > > Unknown
> > > pw.x               0000000000BCEE47  bcast_real_                37
> > > mp_base.f90
> > > pw.x               0000000000BAF7E4  mp_mp_mp_bcast_rv         395
> mp.f90
> > > pw.x               0000000000B6E881  pcdiaghg_                 363
> > > cdiaghg.f90
> > > pw.x               0000000000AF7304  protate_wfc_k_            256
> > > rotate_wfc_k.f90
> > > pw.x               0000000000681E82  rotate_wfc_                64
> > > rotate_wfc.f90
> > > pw.x               000000000064F519  diag_bands_               423
> > > c_bands.f90
> > > pw.x               000000000064CAD4  c_bands_                   99
> > > c_bands.f90
> > > pw.x               000000000040C014  electrons_scf_            552
> > > electrons.f90
> > > pw.x               0000000000408DBD  electrons_                146
> > > electrons.f90
> > > pw.x               000000000057582B  run_pwscf_                132
> > > run_pwscf.f90
> > > pw.x               0000000000406AC5  MAIN__                     77
> > > pwscf.f90
> > > pw.x               000000000040695E  Unknown               Unknown
> Unknown
> > > libc.so.6          00002BA33A0A5445  Unknown               Unknown
> > > Unknown
> > > pw.x               0000000000406869  Unknown               Unknown
> Unknown
> > > forrtl: error (69): process interrupted (SIGINT)
> > > Image              PC                Routine            Line
> > > Source
> > > pw.x               0000000000EAAC45  Unknown               Unknown
> Unknown
> > > pw.x               0000000000EA8867  Unknown               Unknown
> Unknown
> > > pw.x               0000000000E3DC64  Unknown               Unknown
> Unknown
> > > pw.x               0000000000E3DA76  Unknown               Unknown
> Unknown
> > > pw.x               0000000000DC41B6  Unknown               Unknown
> Unknown
> > > pw.x               0000000000DCBB2E  Unknown               Unknown
> Unknown
> > > libpthread.so.0    00002B8E527936D0  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002B8E51CC276E  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002B8E51DCDE39  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002B8E51DCDB32  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002B8E51CA72F9  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002B8E51CA6D5D  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002B8E51CA6BDC  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002B8E51CA6B0C  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002B8E51CA8932  Unknown               Unknown
> > > Unknown
> > > libmpifort.so.12   00002B8E51860B1C  Unknown               Unknown
> > > Unknown
> > > pw.x               0000000000BCEE47  bcast_real_                37
> > > mp_base.f90
> > > pw.x               0000000000BAF7E4  mp_mp_mp_bcast_rv         395
> mp.f90
> > > pw.x               0000000000B6E881  pcdiaghg_                 363
> > > cdiaghg.f90
> > > pw.x               0000000000AF7304  protate_wfc_k_            256
> > > rotate_wfc_k.f90
> > > pw.x               0000000000681E82  rotate_wfc_                64
> > > rotate_wfc.f90
> > > pw.x               000000000064F519  diag_bands_               423
> > > c_bands.f90
> > > pw.x               000000000064CAD4  c_bands_                   99
> > > c_bands.f90
> > > pw.x               000000000040C014  electrons_scf_            552
> > > electrons.f90
> > > pw.x               0000000000408DBD  electrons_                146
> > > electrons.f90
> > > pw.x               000000000057582B  run_pwscf_                132
> > > run_pwscf.f90
> > > pw.x               0000000000406AC5  MAIN__                     77
> > > pwscf.f90
> > > pw.x               000000000040695E  Unknown               Unknown
> Unknown
> > > libc.so.6          00002B8E52CC4445  Unknown               Unknown
> > > Unknown
> > > pw.x               0000000000406869  Unknown               Unknown
> Unknown
> > > forrtl: error (69): process interrupted (SIGINT)
> > > Image              PC                Routine            Line
> > > Source
> > > pw.x               0000000000EAAC45  Unknown               Unknown
> Unknown
> > > pw.x               0000000000EA8867  Unknown               Unknown
> Unknown
> > > pw.x               0000000000E3DC64  Unknown               Unknown
> Unknown
> > > pw.x               0000000000E3DA76  Unknown               Unknown
> Unknown
> > > pw.x               0000000000DC41B6  Unknown               Unknown
> Unknown
> > > pw.x               0000000000DCBB2E  Unknown               Unknown
> Unknown
> > > libpthread.so.0    00002ABAB008D6D0  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ABAAF5BC45C  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ABAAF6C7E39  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ABAAF6C7B32  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ABAAF5A12F9  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ABAAF5A0D5D  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ABAAF5A0BDC  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ABAAF5A0B0C  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ABAAF5A2932  Unknown               Unknown
> > > Unknown
> > > libmpifort.so.12   00002ABAAF15AB1C  Unknown               Unknown
> > > Unknown
> > > pw.x               0000000000BCEE47  bcast_real_                37
> > > mp_base.f90
> > > pw.x               0000000000BAF7E4  mp_mp_mp_bcast_rv         395
> mp.f90
> > > pw.x               0000000000B6E881  pcdiaghg_                 363
> > > cdiaghg.f90
> > > pw.x               0000000000AF7304  protate_wfc_k_            256
> > > rotate_wfc_k.f90
> > > pw.x               0000000000681E82  rotate_wfc_                64
> > > rotate_wfc.f90
> > > pw.x               000000000064F519  diag_bands_               423
> > > c_bands.f90
> > > pw.x               000000000064CAD4  c_bands_                   99
> > > c_bands.f90
> > > pw.x               000000000040C014  electrons_scf_            552
> > > electrons.f90
> > > pw.x               0000000000408DBD  electrons_                146
> > > electrons.f90
> > > pw.x               000000000057582B  run_pwscf_                132
> > > run_pwscf.f90
> > > pw.x               0000000000406AC5  MAIN__                     77
> > > pwscf.f90
> > > pw.x               000000000040695E  Unknown               Unknown
> Unknown
> > > libc.so.6          00002ABAB05BE445  Unknown               Unknown
> > > Unknown
> > > pw.x               0000000000406869  Unknown               Unknown
> Unknown
> > > forrtl: error (69): process interrupted (SIGINT)
> > > Image              PC                Routine            Line
> > > Source
> > > pw.x               0000000000EAAC45  Unknown               Unknown
> Unknown
> > > pw.x               0000000000EA8867  Unknown               Unknown
> Unknown
> > > pw.x               0000000000E3DC64  Unknown               Unknown
> Unknown
> > > pw.x               0000000000E3DA76  Unknown               Unknown
> Unknown
> > > pw.x               0000000000DC41B6  Unknown               Unknown
> Unknown
> > > pw.x               0000000000DCBB2E  Unknown               Unknown
> Unknown
> > > libpthread.so.0    00002ACB4BF866D0  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ACB4B4B5775  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ACB4B5C0E39  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ACB4B5C0B32  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ACB4B49A2F9  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ACB4B499D5D  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ACB4B499BDC  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ACB4B499B0C  Unknown               Unknown
> > > Unknown
> > > libmpi.so.12       00002ACB4B49B932  Unknown               Unknown
> > > Unknown
> > > libmpifort.so.12   00002ACB4B053B1C  Unknown               Unknown
> > > Unknown
> > > pw.x               0000000000BCEE47  bcast_real_                37
> > > mp_base.f90
> > > pw.x               0000000000BAF7E4  mp_mp_mp_bcast_rv         395
> mp.f90
> > > pw.x               0000000000B6E881  pcdiaghg_                 363
> > > cdiaghg.f90
> > > pw.x               0000000000AF7304  protate_wfc_k_            256
> > > rotate_wfc_k.f90
> > > pw.x               0000000000681E82  rotate_wfc_                64
> > > rotate_wfc.f90
> > > pw.x               000000000064F519  diag_bands_               423
> > > c_bands.f90
> > > pw.x               000000000064CAD4  c_bands_                   99
> > > c_bands.f90
> > > pw.x               000000000040C014  electrons_scf_            552
> > > electrons.f90
> > > pw.x               0000000000408DBD  electrons_                146
> > > electrons.f90
> > > pw.x               000000000057582B  run_pwscf_                132
> > > run_pwscf.f90
> > > pw.x               0000000000406AC5  MAIN__                     77
> > > pwscf.f90
> > > pw.x               000000000040695E  Unknown               Unknown
> Unknown
> > > libc.so.6          00002ACB4C4B7445  Unknown               Unknown
> > > Unknown
> > > pw.x               0000000000406869  Unknown               Unknown
> Unknown
> > >
> > > Sample output below
> > >
> > > Parallel version (MPI), running on    16 processors
> > >
> > > MPI processes distributed on     1 nodes         R & G space division:
> > > proc/nbgrp/npool/nimage =      16
> > >
> > > Reading cobalt.scf                               Message from routine
> > > read_cards :                DEPRECATED: no units specified in
> > > ATOMIC_POSITIONS card                                           Message
> > > from routine read_cards :
> > >
> > > ATOMIC_POSITIONS: units set to alat
> > >
> > > Current dimensions of program PWSCF are:
> > >
> > > Max number of different atomic species (ntypx) = 10
> > >
> > > Max number of k-points (npk) =  40000
> > >
> > > Max angular momentum in pseudopotentials (lmaxx) =  3
> > >
> > > Presently no symmetry can be used with electric field
> > >
> > >
> > > file Co.pz-n-kjpaw_psl.1.0.0.UPF: wavefunction(s)  4S 3D renormalized
> > >
> > >
> > > Subspace diagonalization in iterative solution of the eigenvalue
> problem:
> > >
> > > one sub-group per band group will be used
> > >
> > > scalapack distributed-memory algorithm (size of sub-group:  2*  2
> procs)
> > >
> > >
> > > Parallelization info
> > >
> > > --------------------
> > >
> > > sticks:   dense  smooth     PW     G-vecs:    dense   smooth      PW
> > >
> > > Min          13      13      4                 2449     2449     462
> > >
> > > Max          14      14      5                 2516     2516     527
> > >
> > > Sum         221     221     69                39945    39945    7777
> > >
> > > Many thanks,
> > >
> > > Alex Durie
> > > PhD student
> > > Open University
> > > United Kingdom
> > > _______________________________________________
> > > users mailing list
> > > users at lists.quantum-espresso.org
> > > https://lists.quantum-espresso.org/mailman/listinfo/users
> >
> >
> >
> > --
> > Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> > Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> > Phone +39-0432-558216, fax +39-0432-558222
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <
> http://lists.quantum-espresso.org/pipermail/users/attachments/20181209/d619524d/attachment-0001.html
> >
> > _______________________________________________
> > users mailing list
> > users at lists.quantum-espresso.org
> > https://lists.quantum-espresso.org/mailman/listinfo/users
>
>
>
> --
> Dr. Axel Kohlmeyer  akohlmey at gmail.com  http://goo.gl/1wk0
> College of Science & Technology, Temple University, Philadelphia PA, USA
> International Centre for Theoretical Physics, Trieste. Italy.
> _______________________________________________
> users mailing list
> users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
>


-- 
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20181210/cfeb6faf/attachment.html>


More information about the users mailing list