<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
</head>
<body dir="ltr">
<div id="divtagdefaultwrapper" style="font-size: 12pt; color: rgb(0, 0, 0); font-family: Calibri, Helvetica, sans-serif, EmojiFont, "Apple Color Emoji", "Segoe UI Emoji", NotoColorEmoji, "Segoe UI Symbol", "Android Emoji", EmojiSymbols;" dir="ltr">
<p style="margin-top:0;margin-bottom:0">The problem seems to crop up when a minimum of 8 processors are used. As a quick and easily accessible test, I tried it on example08 of the Wannier90 examples with the following command;</p>
<p style="margin-top:0;margin-bottom:0"><br>
</p>
<p style="margin-top:0;margin-bottom:0">mpirun -np 8 pw.x -i iron.scf > scf.out</p>
<p style="margin-top:0;margin-bottom:0"><br>
</p>
<p style="margin-top:0;margin-bottom:0">and the same problem occurred. I am using PWSCF v.6.3 using the Intel parallel studio 2016 suite. PW was built using all intel compilers, intel MPI and mkl.</p>
<p style="margin-top:0;margin-bottom:0"><br>
</p>
<p style="margin-top:0;margin-bottom:0">Many thanks,</p>
<p style="margin-top:0;margin-bottom:0"><br>
</p>
<p style="margin-top:0;margin-bottom:0">Alex</p>
<div style="color: rgb(0, 0, 0);">
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText"><br>
Date: Sun, 9 Dec 2018 21:26:31 +0100<br>
From: Paolo Giannozzi <p.giannozzi@gmail.com><br>
To: Quantum Espresso users Forum <users@lists.quantum-espresso.org><br>
Subject: Re: [QE-users] MPI error in pw.x<br>
Message-ID:<br>
<CAPMgbCs0VU+GJZJ_TY3cfX+8pGoRVsuF5LvW0k7+fyB+sBZ2Hg@mail.gmail.com><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
If it is not a problem of your compiler or mpi libraries, it can only be<br>
the usual problem of irreproducibility of results on different processors.<br>
In order to figure out this, one needs as a strict minimum some information<br>
on which exact version exhibits the problem, under which exact<br>
circumstances (e.g. mpirun -np ... ) and an input that can be run in a<br>
reasonable amount of time on a reasonably small machine.<br>
<br>
Paolo<br>
-<br>
<br>
On Sat, Dec 8, 2018 at 9:55 PM Alex.Durie <alex.durie@open.ac.uk> wrote:<br>
<br>
> Dear experts,<br>
><br>
> I have been running pw.x with multiple processes quite successfully,<br>
> however when the number of processes is high enough, such that the space<br>
> group has more than 7 processes, where the subspace diagonalization no<br>
> longer uses a serial algorithm, the program crashes abruptly at about the<br>
> 10th iteration with the following errors;<br>
><br>
> Fatal error in PMPI_Cart_sub: Other MPI error, error stack:<br>
> PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,<br>
> remain_dims=0x7ffe0b27a6e8, comm_new=0x7ffe0b27a640) failed<br>
> PMPI_Cart_sub(178)...................:<br>
> MPIR_Comm_split_impl(270)............:<br>
> MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384<br>
> free on this process; ignore_id=0)<br>
> Fatal error in PMPI_Cart_sub: Other MPI error, error stack:<br>
> PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,<br>
> remain_dims=0x7ffefaee7ce8, comm_new=0x7ffefaee7c40) failed<br>
> PMPI_Cart_sub(178)...................:<br>
> MPIR_Comm_split_impl(270)............:<br>
> MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384<br>
> free on this process; ignore_id=0)<br>
> Fatal error in PMPI_Cart_sub: Other MPI error, error stack:<br>
> PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,<br>
> remain_dims=0x7fffc482d168, comm_new=0x7fffc482d0c0) failed<br>
> PMPI_Cart_sub(178)...................:<br>
> MPIR_Comm_split_impl(270)............:<br>
> MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384<br>
> free on this process; ignore_id=0)<br>
> Fatal error in PMPI_Cart_sub: Other MPI error, error stack:<br>
> PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,<br>
> remain_dims=0x7ffe23a022e8, comm_new=0x7ffe23a02240) failed<br>
> PMPI_Cart_sub(178)...................:<br>
> MPIR_Comm_split_impl(270)............:<br>
> MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384<br>
> free on this process; ignore_id=0)<br>
> forrtl: error (69): process interrupted (SIGINT)<br>
> Image PC Routine Line<br>
> Source<br>
> pw.x 0000000000EAAC45 Unknown Unknown Unknown<br>
> pw.x 0000000000EA8867 Unknown Unknown Unknown<br>
> pw.x 0000000000E3DC64 Unknown Unknown Unknown<br>
> pw.x 0000000000E3DA76 Unknown Unknown Unknown<br>
> pw.x 0000000000DC41B6 Unknown Unknown Unknown<br>
> pw.x 0000000000DCBB2E Unknown Unknown Unknown<br>
> libpthread.so.0 00002BA339B746D0 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002BA3390A345F Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002BA3391AEE39 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002BA3391AEB32 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002BA3390882F9 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002BA339087D5D Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002BA339087BDC Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002BA339087B0C Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002BA339089932 Unknown Unknown<br>
> Unknown<br>
> libmpifort.so.12 00002BA338C41B1C Unknown Unknown<br>
> Unknown<br>
> pw.x 0000000000BCEE47 bcast_real_ 37<br>
> mp_base.f90<br>
> pw.x 0000000000BAF7E4 mp_mp_mp_bcast_rv 395 mp.f90<br>
> pw.x 0000000000B6E881 pcdiaghg_ 363<br>
> cdiaghg.f90<br>
> pw.x 0000000000AF7304 protate_wfc_k_ 256<br>
> rotate_wfc_k.f90<br>
> pw.x 0000000000681E82 rotate_wfc_ 64<br>
> rotate_wfc.f90<br>
> pw.x 000000000064F519 diag_bands_ 423<br>
> c_bands.f90<br>
> pw.x 000000000064CAD4 c_bands_ 99<br>
> c_bands.f90<br>
> pw.x 000000000040C014 electrons_scf_ 552<br>
> electrons.f90<br>
> pw.x 0000000000408DBD electrons_ 146<br>
> electrons.f90<br>
> pw.x 000000000057582B run_pwscf_ 132<br>
> run_pwscf.f90<br>
> pw.x 0000000000406AC5 MAIN__ 77<br>
> pwscf.f90<br>
> pw.x 000000000040695E Unknown Unknown Unknown<br>
> libc.so.6 00002BA33A0A5445 Unknown Unknown<br>
> Unknown<br>
> pw.x 0000000000406869 Unknown Unknown Unknown<br>
> forrtl: error (69): process interrupted (SIGINT)<br>
> Image PC Routine Line<br>
> Source<br>
> pw.x 0000000000EAAC45 Unknown Unknown Unknown<br>
> pw.x 0000000000EA8867 Unknown Unknown Unknown<br>
> pw.x 0000000000E3DC64 Unknown Unknown Unknown<br>
> pw.x 0000000000E3DA76 Unknown Unknown Unknown<br>
> pw.x 0000000000DC41B6 Unknown Unknown Unknown<br>
> pw.x 0000000000DCBB2E Unknown Unknown Unknown<br>
> libpthread.so.0 00002B8E527936D0 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002B8E51CC276E Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002B8E51DCDE39 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002B8E51DCDB32 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002B8E51CA72F9 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002B8E51CA6D5D Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002B8E51CA6BDC Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002B8E51CA6B0C Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002B8E51CA8932 Unknown Unknown<br>
> Unknown<br>
> libmpifort.so.12 00002B8E51860B1C Unknown Unknown<br>
> Unknown<br>
> pw.x 0000000000BCEE47 bcast_real_ 37<br>
> mp_base.f90<br>
> pw.x 0000000000BAF7E4 mp_mp_mp_bcast_rv 395 mp.f90<br>
> pw.x 0000000000B6E881 pcdiaghg_ 363<br>
> cdiaghg.f90<br>
> pw.x 0000000000AF7304 protate_wfc_k_ 256<br>
> rotate_wfc_k.f90<br>
> pw.x 0000000000681E82 rotate_wfc_ 64<br>
> rotate_wfc.f90<br>
> pw.x 000000000064F519 diag_bands_ 423<br>
> c_bands.f90<br>
> pw.x 000000000064CAD4 c_bands_ 99<br>
> c_bands.f90<br>
> pw.x 000000000040C014 electrons_scf_ 552<br>
> electrons.f90<br>
> pw.x 0000000000408DBD electrons_ 146<br>
> electrons.f90<br>
> pw.x 000000000057582B run_pwscf_ 132<br>
> run_pwscf.f90<br>
> pw.x 0000000000406AC5 MAIN__ 77<br>
> pwscf.f90<br>
> pw.x 000000000040695E Unknown Unknown Unknown<br>
> libc.so.6 00002B8E52CC4445 Unknown Unknown<br>
> Unknown<br>
> pw.x 0000000000406869 Unknown Unknown Unknown<br>
> forrtl: error (69): process interrupted (SIGINT)<br>
> Image PC Routine Line<br>
> Source<br>
> pw.x 0000000000EAAC45 Unknown Unknown Unknown<br>
> pw.x 0000000000EA8867 Unknown Unknown Unknown<br>
> pw.x 0000000000E3DC64 Unknown Unknown Unknown<br>
> pw.x 0000000000E3DA76 Unknown Unknown Unknown<br>
> pw.x 0000000000DC41B6 Unknown Unknown Unknown<br>
> pw.x 0000000000DCBB2E Unknown Unknown Unknown<br>
> libpthread.so.0 00002ABAB008D6D0 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ABAAF5BC45C Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ABAAF6C7E39 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ABAAF6C7B32 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ABAAF5A12F9 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ABAAF5A0D5D Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ABAAF5A0BDC Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ABAAF5A0B0C Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ABAAF5A2932 Unknown Unknown<br>
> Unknown<br>
> libmpifort.so.12 00002ABAAF15AB1C Unknown Unknown<br>
> Unknown<br>
> pw.x 0000000000BCEE47 bcast_real_ 37<br>
> mp_base.f90<br>
> pw.x 0000000000BAF7E4 mp_mp_mp_bcast_rv 395 mp.f90<br>
> pw.x 0000000000B6E881 pcdiaghg_ 363<br>
> cdiaghg.f90<br>
> pw.x 0000000000AF7304 protate_wfc_k_ 256<br>
> rotate_wfc_k.f90<br>
> pw.x 0000000000681E82 rotate_wfc_ 64<br>
> rotate_wfc.f90<br>
> pw.x 000000000064F519 diag_bands_ 423<br>
> c_bands.f90<br>
> pw.x 000000000064CAD4 c_bands_ 99<br>
> c_bands.f90<br>
> pw.x 000000000040C014 electrons_scf_ 552<br>
> electrons.f90<br>
> pw.x 0000000000408DBD electrons_ 146<br>
> electrons.f90<br>
> pw.x 000000000057582B run_pwscf_ 132<br>
> run_pwscf.f90<br>
> pw.x 0000000000406AC5 MAIN__ 77<br>
> pwscf.f90<br>
> pw.x 000000000040695E Unknown Unknown Unknown<br>
> libc.so.6 00002ABAB05BE445 Unknown Unknown<br>
> Unknown<br>
> pw.x 0000000000406869 Unknown Unknown Unknown<br>
> forrtl: error (69): process interrupted (SIGINT)<br>
> Image PC Routine Line<br>
> Source<br>
> pw.x 0000000000EAAC45 Unknown Unknown Unknown<br>
> pw.x 0000000000EA8867 Unknown Unknown Unknown<br>
> pw.x 0000000000E3DC64 Unknown Unknown Unknown<br>
> pw.x 0000000000E3DA76 Unknown Unknown Unknown<br>
> pw.x 0000000000DC41B6 Unknown Unknown Unknown<br>
> pw.x 0000000000DCBB2E Unknown Unknown Unknown<br>
> libpthread.so.0 00002ACB4BF866D0 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ACB4B4B5775 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ACB4B5C0E39 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ACB4B5C0B32 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ACB4B49A2F9 Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ACB4B499D5D Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ACB4B499BDC Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ACB4B499B0C Unknown Unknown<br>
> Unknown<br>
> libmpi.so.12 00002ACB4B49B932 Unknown Unknown<br>
> Unknown<br>
> libmpifort.so.12 00002ACB4B053B1C Unknown Unknown<br>
> Unknown<br>
> pw.x 0000000000BCEE47 bcast_real_ 37<br>
> mp_base.f90<br>
> pw.x 0000000000BAF7E4 mp_mp_mp_bcast_rv 395 mp.f90<br>
> pw.x 0000000000B6E881 pcdiaghg_ 363<br>
> cdiaghg.f90<br>
> pw.x 0000000000AF7304 protate_wfc_k_ 256<br>
> rotate_wfc_k.f90<br>
> pw.x 0000000000681E82 rotate_wfc_ 64<br>
> rotate_wfc.f90<br>
> pw.x 000000000064F519 diag_bands_ 423<br>
> c_bands.f90<br>
> pw.x 000000000064CAD4 c_bands_ 99<br>
> c_bands.f90<br>
> pw.x 000000000040C014 electrons_scf_ 552<br>
> electrons.f90<br>
> pw.x 0000000000408DBD electrons_ 146<br>
> electrons.f90<br>
> pw.x 000000000057582B run_pwscf_ 132<br>
> run_pwscf.f90<br>
> pw.x 0000000000406AC5 MAIN__ 77<br>
> pwscf.f90<br>
> pw.x 000000000040695E Unknown Unknown Unknown<br>
> libc.so.6 00002ACB4C4B7445 Unknown Unknown<br>
> Unknown<br>
> pw.x 0000000000406869 Unknown Unknown Unknown<br>
><br>
> Sample output below<br>
><br>
> Parallel version (MPI), running on 16 processors<br>
><br>
> MPI processes distributed on 1 nodes R & G space division:<br>
> proc/nbgrp/npool/nimage = 16<br>
><br>
> Reading cobalt.scf Message from routine<br>
> read_cards : DEPRECATED: no units specified in<br>
> ATOMIC_POSITIONS card Message<br>
> from routine read_cards :<br>
><br>
> ATOMIC_POSITIONS: units set to alat<br>
><br>
> Current dimensions of program PWSCF are:<br>
><br>
> Max number of different atomic species (ntypx) = 10<br>
><br>
> Max number of k-points (npk) = 40000<br>
><br>
> Max angular momentum in pseudopotentials (lmaxx) = 3<br>
><br>
> Presently no symmetry can be used with electric field<br>
><br>
><br>
> file Co.pz-n-kjpaw_psl.1.0.0.UPF: wavefunction(s) 4S 3D renormalized<br>
><br>
><br>
> Subspace diagonalization in iterative solution of the eigenvalue problem:<br>
><br>
> one sub-group per band group will be used<br>
><br>
> scalapack distributed-memory algorithm (size of sub-group: 2* 2 procs)<br>
><br>
><br>
> Parallelization info<br>
><br>
> --------------------<br>
><br>
> sticks: dense smooth PW G-vecs: dense smooth PW<br>
><br>
> Min 13 13 4 2449 2449 462<br>
><br>
> Max 14 14 5 2516 2516 527<br>
><br>
> Sum 221 221 69 39945 39945 7777<br>
><br>
> Many thanks,<br>
><br>
> Alex Durie<br>
> PhD student<br>
> Open University<br>
> United Kingdom<br>
> _______________________________________________<br>
> users mailing list<br>
> users@lists.quantum-espresso.org<br>
> <a href="https://lists.quantum-espresso.org/mailman/listinfo/users" id="LPlnk957835" class="OWAAutoLink" previewremoved="true">
https://lists.quantum-espresso.org/mailman/listinfo/users</a><br>
<br>
<br>
<br>
-- <br>
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,<br>
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy<br>
Phone +39-0432-558216, fax +39-0432-558222<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://lists.quantum-espresso.org/pipermail/users/attachments/20181209/d619524d/attachment-0001.html" id="LPlnk721077" class="OWAAutoLink" previewremoved="true">http://lists.quantum-espresso.org/pipermail/users/attachments/20181209/d619524d/attachment-0001.html</a>><br>
</div>
</span></font></div>
</div>
</div>
</body>
</html>