<div dir="ltr"><div>HI Axel, such a bug was present some time ago but it has been fixed since (I think). <br></div><div><br></div><div>I cannot reproduce the problem with an old Intel compiler and OpenMPI.</div><div><br></div><div>Psolo<br></div></div><br><div class="gmail_quote"><div dir="ltr">On Mon, Dec 10, 2018 at 1:16 PM Axel Kohlmeyer <<a href="mailto:akohlmey@gmail.com">akohlmey@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">a careful look at the error message reveals, that you are running out<br>
of space for MPI communicators for which a fixed maximum number (16k)<br>
seems to be allowed.<br>
this hints at a problem somewhere that communicators are generated<br>
with MPI_Comm_split() and not properly cleared afterwards.<br>
<br>
axel.<br>
<br>
<br>
On Mon, Dec 10, 2018 at 6:53 AM Alex.Durie <<a href="mailto:alex.durie@open.ac.uk" target="_blank">alex.durie@open.ac.uk</a>> wrote:<br>
><br>
> The problem seems to crop up when a minimum of 8 processors are used. As a quick and easily accessible test, I tried it on example08 of the Wannier90 examples with the following command;<br>
><br>
><br>
> mpirun -np 8 pw.x -i iron.scf > scf.out<br>
><br>
><br>
> and the same problem occurred. I am using PWSCF v.6.3 using the Intel parallel studio 2016 suite. PW was built using all intel compilers, intel MPI and mkl.<br>
><br>
><br>
> Many thanks,<br>
><br>
><br>
> Alex<br>
><br>
><br>
> Date: Sun, 9 Dec 2018 21:26:31 +0100<br>
> From: Paolo Giannozzi <<a href="mailto:p.giannozzi@gmail.com" target="_blank">p.giannozzi@gmail.com</a>><br>
> To: Quantum Espresso users Forum <<a href="mailto:users@lists.quantum-espresso.org" target="_blank">users@lists.quantum-espresso.org</a>><br>
> Subject: Re: [QE-users] MPI error in pw.x<br>
> Message-ID:<br>
>         <<a href="mailto:CAPMgbCs0VU%2BGJZJ_TY3cfX%2B8pGoRVsuF5LvW0k7%2BfyB%2BsBZ2Hg@mail.gmail.com" target="_blank">CAPMgbCs0VU+GJZJ_TY3cfX+8pGoRVsuF5LvW0k7+fyB+sBZ2Hg@mail.gmail.com</a>><br>
> Content-Type: text/plain; charset="utf-8"<br>
><br>
> If it is not a problem of your compiler or mpi libraries, it can only be<br>
> the usual problem of irreproducibility of results on different processors.<br>
> In order to figure out this, one needs as a strict minimum some information<br>
> on which exact version exhibits the problem, under which exact<br>
> circumstances (e.g. mpirun -np ... ) and an input that can be run in a<br>
> reasonable amount of time on a reasonably small machine.<br>
><br>
> Paolo<br>
> -<br>
><br>
> On Sat, Dec 8, 2018 at 9:55 PM Alex.Durie <<a href="mailto:alex.durie@open.ac.uk" target="_blank">alex.durie@open.ac.uk</a>> wrote:<br>
><br>
> > Dear experts,<br>
> ><br>
> > I have been running pw.x with multiple processes quite successfully,<br>
> > however when the number of processes is high enough, such that the space<br>
> > group has more than 7 processes, where the subspace diagonalization no<br>
> > longer uses a serial algorithm, the program crashes abruptly at about the<br>
> > 10th iteration with the following errors;<br>
> ><br>
> > Fatal error in PMPI_Cart_sub: Other MPI error, error stack:<br>
> > PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,<br>
> > remain_dims=0x7ffe0b27a6e8, comm_new=0x7ffe0b27a640) failed<br>
> > PMPI_Cart_sub(178)...................:<br>
> > MPIR_Comm_split_impl(270)............:<br>
> > MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384<br>
> > free on this process; ignore_id=0)<br>
> > Fatal error in PMPI_Cart_sub: Other MPI error, error stack:<br>
> > PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,<br>
> > remain_dims=0x7ffefaee7ce8, comm_new=0x7ffefaee7c40) failed<br>
> > PMPI_Cart_sub(178)...................:<br>
> > MPIR_Comm_split_impl(270)............:<br>
> > MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384<br>
> > free on this process; ignore_id=0)<br>
> > Fatal error in PMPI_Cart_sub: Other MPI error, error stack:<br>
> > PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,<br>
> > remain_dims=0x7fffc482d168, comm_new=0x7fffc482d0c0) failed<br>
> > PMPI_Cart_sub(178)...................:<br>
> > MPIR_Comm_split_impl(270)............:<br>
> > MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384<br>
> > free on this process; ignore_id=0)<br>
> > Fatal error in PMPI_Cart_sub: Other MPI error, error stack:<br>
> > PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,<br>
> > remain_dims=0x7ffe23a022e8, comm_new=0x7ffe23a02240) failed<br>
> > PMPI_Cart_sub(178)...................:<br>
> > MPIR_Comm_split_impl(270)............:<br>
> > MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384<br>
> > free on this process; ignore_id=0)<br>
> > forrtl: error (69): process interrupted (SIGINT)<br>
> > Image              PC                Routine            Line<br>
> > Source<br>
> > pw.x               0000000000EAAC45  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000EA8867  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000E3DC64  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000E3DA76  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000DC41B6  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000DCBB2E  Unknown               Unknown  Unknown<br>
> > libpthread.so.0    00002BA339B746D0  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002BA3390A345F  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002BA3391AEE39  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002BA3391AEB32  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002BA3390882F9  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002BA339087D5D  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002BA339087BDC  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002BA339087B0C  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002BA339089932  Unknown               Unknown<br>
> > Unknown<br>
> > libmpifort.so.12   00002BA338C41B1C  Unknown               Unknown<br>
> > Unknown<br>
> > pw.x               0000000000BCEE47  bcast_real_                37<br>
> > mp_base.f90<br>
> > pw.x               0000000000BAF7E4  mp_mp_mp_bcast_rv         395  mp.f90<br>
> > pw.x               0000000000B6E881  pcdiaghg_                 363<br>
> > cdiaghg.f90<br>
> > pw.x               0000000000AF7304  protate_wfc_k_            256<br>
> > rotate_wfc_k.f90<br>
> > pw.x               0000000000681E82  rotate_wfc_                64<br>
> > rotate_wfc.f90<br>
> > pw.x               000000000064F519  diag_bands_               423<br>
> > c_bands.f90<br>
> > pw.x               000000000064CAD4  c_bands_                   99<br>
> > c_bands.f90<br>
> > pw.x               000000000040C014  electrons_scf_            552<br>
> > electrons.f90<br>
> > pw.x               0000000000408DBD  electrons_                146<br>
> > electrons.f90<br>
> > pw.x               000000000057582B  run_pwscf_                132<br>
> > run_pwscf.f90<br>
> > pw.x               0000000000406AC5  MAIN__                     77<br>
> > pwscf.f90<br>
> > pw.x               000000000040695E  Unknown               Unknown  Unknown<br>
> > libc.so.6          00002BA33A0A5445  Unknown               Unknown<br>
> > Unknown<br>
> > pw.x               0000000000406869  Unknown               Unknown  Unknown<br>
> > forrtl: error (69): process interrupted (SIGINT)<br>
> > Image              PC                Routine            Line<br>
> > Source<br>
> > pw.x               0000000000EAAC45  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000EA8867  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000E3DC64  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000E3DA76  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000DC41B6  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000DCBB2E  Unknown               Unknown  Unknown<br>
> > libpthread.so.0    00002B8E527936D0  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002B8E51CC276E  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002B8E51DCDE39  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002B8E51DCDB32  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002B8E51CA72F9  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002B8E51CA6D5D  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002B8E51CA6BDC  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002B8E51CA6B0C  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002B8E51CA8932  Unknown               Unknown<br>
> > Unknown<br>
> > libmpifort.so.12   00002B8E51860B1C  Unknown               Unknown<br>
> > Unknown<br>
> > pw.x               0000000000BCEE47  bcast_real_                37<br>
> > mp_base.f90<br>
> > pw.x               0000000000BAF7E4  mp_mp_mp_bcast_rv         395  mp.f90<br>
> > pw.x               0000000000B6E881  pcdiaghg_                 363<br>
> > cdiaghg.f90<br>
> > pw.x               0000000000AF7304  protate_wfc_k_            256<br>
> > rotate_wfc_k.f90<br>
> > pw.x               0000000000681E82  rotate_wfc_                64<br>
> > rotate_wfc.f90<br>
> > pw.x               000000000064F519  diag_bands_               423<br>
> > c_bands.f90<br>
> > pw.x               000000000064CAD4  c_bands_                   99<br>
> > c_bands.f90<br>
> > pw.x               000000000040C014  electrons_scf_            552<br>
> > electrons.f90<br>
> > pw.x               0000000000408DBD  electrons_                146<br>
> > electrons.f90<br>
> > pw.x               000000000057582B  run_pwscf_                132<br>
> > run_pwscf.f90<br>
> > pw.x               0000000000406AC5  MAIN__                     77<br>
> > pwscf.f90<br>
> > pw.x               000000000040695E  Unknown               Unknown  Unknown<br>
> > libc.so.6          00002B8E52CC4445  Unknown               Unknown<br>
> > Unknown<br>
> > pw.x               0000000000406869  Unknown               Unknown  Unknown<br>
> > forrtl: error (69): process interrupted (SIGINT)<br>
> > Image              PC                Routine            Line<br>
> > Source<br>
> > pw.x               0000000000EAAC45  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000EA8867  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000E3DC64  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000E3DA76  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000DC41B6  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000DCBB2E  Unknown               Unknown  Unknown<br>
> > libpthread.so.0    00002ABAB008D6D0  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ABAAF5BC45C  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ABAAF6C7E39  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ABAAF6C7B32  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ABAAF5A12F9  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ABAAF5A0D5D  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ABAAF5A0BDC  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ABAAF5A0B0C  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ABAAF5A2932  Unknown               Unknown<br>
> > Unknown<br>
> > libmpifort.so.12   00002ABAAF15AB1C  Unknown               Unknown<br>
> > Unknown<br>
> > pw.x               0000000000BCEE47  bcast_real_                37<br>
> > mp_base.f90<br>
> > pw.x               0000000000BAF7E4  mp_mp_mp_bcast_rv         395  mp.f90<br>
> > pw.x               0000000000B6E881  pcdiaghg_                 363<br>
> > cdiaghg.f90<br>
> > pw.x               0000000000AF7304  protate_wfc_k_            256<br>
> > rotate_wfc_k.f90<br>
> > pw.x               0000000000681E82  rotate_wfc_                64<br>
> > rotate_wfc.f90<br>
> > pw.x               000000000064F519  diag_bands_               423<br>
> > c_bands.f90<br>
> > pw.x               000000000064CAD4  c_bands_                   99<br>
> > c_bands.f90<br>
> > pw.x               000000000040C014  electrons_scf_            552<br>
> > electrons.f90<br>
> > pw.x               0000000000408DBD  electrons_                146<br>
> > electrons.f90<br>
> > pw.x               000000000057582B  run_pwscf_                132<br>
> > run_pwscf.f90<br>
> > pw.x               0000000000406AC5  MAIN__                     77<br>
> > pwscf.f90<br>
> > pw.x               000000000040695E  Unknown               Unknown  Unknown<br>
> > libc.so.6          00002ABAB05BE445  Unknown               Unknown<br>
> > Unknown<br>
> > pw.x               0000000000406869  Unknown               Unknown  Unknown<br>
> > forrtl: error (69): process interrupted (SIGINT)<br>
> > Image              PC                Routine            Line<br>
> > Source<br>
> > pw.x               0000000000EAAC45  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000EA8867  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000E3DC64  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000E3DA76  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000DC41B6  Unknown               Unknown  Unknown<br>
> > pw.x               0000000000DCBB2E  Unknown               Unknown  Unknown<br>
> > libpthread.so.0    00002ACB4BF866D0  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ACB4B4B5775  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ACB4B5C0E39  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ACB4B5C0B32  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ACB4B49A2F9  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ACB4B499D5D  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ACB4B499BDC  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ACB4B499B0C  Unknown               Unknown<br>
> > Unknown<br>
> > libmpi.so.12       00002ACB4B49B932  Unknown               Unknown<br>
> > Unknown<br>
> > libmpifort.so.12   00002ACB4B053B1C  Unknown               Unknown<br>
> > Unknown<br>
> > pw.x               0000000000BCEE47  bcast_real_                37<br>
> > mp_base.f90<br>
> > pw.x               0000000000BAF7E4  mp_mp_mp_bcast_rv         395  mp.f90<br>
> > pw.x               0000000000B6E881  pcdiaghg_                 363<br>
> > cdiaghg.f90<br>
> > pw.x               0000000000AF7304  protate_wfc_k_            256<br>
> > rotate_wfc_k.f90<br>
> > pw.x               0000000000681E82  rotate_wfc_                64<br>
> > rotate_wfc.f90<br>
> > pw.x               000000000064F519  diag_bands_               423<br>
> > c_bands.f90<br>
> > pw.x               000000000064CAD4  c_bands_                   99<br>
> > c_bands.f90<br>
> > pw.x               000000000040C014  electrons_scf_            552<br>
> > electrons.f90<br>
> > pw.x               0000000000408DBD  electrons_                146<br>
> > electrons.f90<br>
> > pw.x               000000000057582B  run_pwscf_                132<br>
> > run_pwscf.f90<br>
> > pw.x               0000000000406AC5  MAIN__                     77<br>
> > pwscf.f90<br>
> > pw.x               000000000040695E  Unknown               Unknown  Unknown<br>
> > libc.so.6          00002ACB4C4B7445  Unknown               Unknown<br>
> > Unknown<br>
> > pw.x               0000000000406869  Unknown               Unknown  Unknown<br>
> ><br>
> > Sample output below<br>
> ><br>
> > Parallel version (MPI), running on    16 processors<br>
> ><br>
> > MPI processes distributed on     1 nodes         R & G space division:<br>
> > proc/nbgrp/npool/nimage =      16<br>
> ><br>
> > Reading cobalt.scf                               Message from routine<br>
> > read_cards :                DEPRECATED: no units specified in<br>
> > ATOMIC_POSITIONS card                                           Message<br>
> > from routine read_cards :<br>
> ><br>
> > ATOMIC_POSITIONS: units set to alat<br>
> ><br>
> > Current dimensions of program PWSCF are:<br>
> ><br>
> > Max number of different atomic species (ntypx) = 10<br>
> ><br>
> > Max number of k-points (npk) =  40000<br>
> ><br>
> > Max angular momentum in pseudopotentials (lmaxx) =  3<br>
> ><br>
> > Presently no symmetry can be used with electric field<br>
> ><br>
> ><br>
> > file Co.pz-n-kjpaw_psl.1.0.0.UPF: wavefunction(s)  4S 3D renormalized<br>
> ><br>
> ><br>
> > Subspace diagonalization in iterative solution of the eigenvalue problem:<br>
> ><br>
> > one sub-group per band group will be used<br>
> ><br>
> > scalapack distributed-memory algorithm (size of sub-group:  2*  2 procs)<br>
> ><br>
> ><br>
> > Parallelization info<br>
> ><br>
> > --------------------<br>
> ><br>
> > sticks:   dense  smooth     PW     G-vecs:    dense   smooth      PW<br>
> ><br>
> > Min          13      13      4                 2449     2449     462<br>
> ><br>
> > Max          14      14      5                 2516     2516     527<br>
> ><br>
> > Sum         221     221     69                39945    39945    7777<br>
> ><br>
> > Many thanks,<br>
> ><br>
> > Alex Durie<br>
> > PhD student<br>
> > Open University<br>
> > United Kingdom<br>
> > _______________________________________________<br>
> > users mailing list<br>
> > <a href="mailto:users@lists.quantum-espresso.org" target="_blank">users@lists.quantum-espresso.org</a><br>
> > <a href="https://lists.quantum-espresso.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.quantum-espresso.org/mailman/listinfo/users</a><br>
><br>
><br>
><br>
> --<br>
> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,<br>
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy<br>
> Phone +39-0432-558216, fax +39-0432-558222<br>
> -------------- next part --------------<br>
> An HTML attachment was scrubbed...<br>
> URL: <<a href="http://lists.quantum-espresso.org/pipermail/users/attachments/20181209/d619524d/attachment-0001.html" rel="noreferrer" target="_blank">http://lists.quantum-espresso.org/pipermail/users/attachments/20181209/d619524d/attachment-0001.html</a>><br>
> _______________________________________________<br>
> users mailing list<br>
> <a href="mailto:users@lists.quantum-espresso.org" target="_blank">users@lists.quantum-espresso.org</a><br>
> <a href="https://lists.quantum-espresso.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.quantum-espresso.org/mailman/listinfo/users</a><br>
<br>
<br>
<br>
-- <br>
Dr. Axel Kohlmeyer  <a href="mailto:akohlmey@gmail.com" target="_blank">akohlmey@gmail.com</a>  <a href="http://goo.gl/1wk0" rel="noreferrer" target="_blank">http://goo.gl/1wk0</a><br>
College of Science & Technology, Temple University, Philadelphia PA, USA<br>
International Centre for Theoretical Physics, Trieste. Italy.<br>
_______________________________________________<br>
users mailing list<br>
<a href="mailto:users@lists.quantum-espresso.org" target="_blank">users@lists.quantum-espresso.org</a><br>
<a href="https://lists.quantum-espresso.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.quantum-espresso.org/mailman/listinfo/users</a><br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div>Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,<br>Univ. Udine, via delle Scienze 208, 33100 Udine, Italy<br>Phone +39-0432-558216, fax +39-0432-558222<br><br></div></div></div></div></div>