[QE-users] MPI error in pw.x

Alex.Durie alex.durie at open.ac.uk
Sat Dec 8 21:55:25 CET 2018


Dear experts,

I have been running pw.x with multiple processes quite successfully, however when the number of processes is high enough, such that the space group has more than 7 processes, where the subspace diagonalization no longer uses a serial algorithm, the program crashes abruptly at about the 10th iteration with the following errors;

Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffe0b27a6e8, comm_new=0x7ffe0b27a640) failed
PMPI_Cart_sub(178)...................:
MPIR_Comm_split_impl(270)............:
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffefaee7ce8, comm_new=0x7ffefaee7c40) failed
PMPI_Cart_sub(178)...................:
MPIR_Comm_split_impl(270)............:
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7fffc482d168, comm_new=0x7fffc482d0c0) failed
PMPI_Cart_sub(178)...................:
MPIR_Comm_split_impl(270)............:
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffe23a022e8, comm_new=0x7ffe23a02240) failed
PMPI_Cart_sub(178)...................:
MPIR_Comm_split_impl(270)............:
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
forrtl: error (69): process interrupted (SIGINT)
Image              PC                Routine            Line        Source
pw.x               0000000000EAAC45  Unknown               Unknown  Unknown
pw.x               0000000000EA8867  Unknown               Unknown  Unknown
pw.x               0000000000E3DC64  Unknown               Unknown  Unknown
pw.x               0000000000E3DA76  Unknown               Unknown  Unknown
pw.x               0000000000DC41B6  Unknown               Unknown  Unknown
pw.x               0000000000DCBB2E  Unknown               Unknown  Unknown
libpthread.so<http://libpthread.so>.0    00002BA339B746D0  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002BA3390A345F  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002BA3391AEE39  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002BA3391AEB32  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002BA3390882F9  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002BA339087D5D  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002BA339087BDC  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002BA339087B0C  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002BA339089932  Unknown               Unknown  Unknown
libmpifort.so<http://libmpifort.so>.12   00002BA338C41B1C  Unknown               Unknown  Unknown
pw.x               0000000000BCEE47  bcast_real_                37  mp_base.f90
pw.x               0000000000BAF7E4  mp_mp_mp_bcast_rv         395  mp.f90
pw.x               0000000000B6E881  pcdiaghg_                 363  cdiaghg.f90
pw.x               0000000000AF7304  protate_wfc_k_            256  rotate_wfc_k.f90
pw.x               0000000000681E82  rotate_wfc_                64  rotate_wfc.f90
pw.x               000000000064F519  diag_bands_               423  c_bands.f90
pw.x               000000000064CAD4  c_bands_                   99  c_bands.f90
pw.x               000000000040C014  electrons_scf_            552  electrons.f90
pw.x               0000000000408DBD  electrons_                146  electrons.f90
pw.x               000000000057582B  run_pwscf_                132  run_pwscf.f90
pw.x               0000000000406AC5  MAIN__                     77  pwscf.f90
pw.x               000000000040695E  Unknown               Unknown  Unknown
libc.so<http://libc.so>.6          00002BA33A0A5445  Unknown               Unknown  Unknown
pw.x               0000000000406869  Unknown               Unknown  Unknown
forrtl: error (69): process interrupted (SIGINT)
Image              PC                Routine            Line        Source
pw.x               0000000000EAAC45  Unknown               Unknown  Unknown
pw.x               0000000000EA8867  Unknown               Unknown  Unknown
pw.x               0000000000E3DC64  Unknown               Unknown  Unknown
pw.x               0000000000E3DA76  Unknown               Unknown  Unknown
pw.x               0000000000DC41B6  Unknown               Unknown  Unknown
pw.x               0000000000DCBB2E  Unknown               Unknown  Unknown
libpthread.so<http://libpthread.so>.0    00002B8E527936D0  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002B8E51CC276E  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002B8E51DCDE39  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002B8E51DCDB32  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002B8E51CA72F9  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002B8E51CA6D5D  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002B8E51CA6BDC  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002B8E51CA6B0C  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002B8E51CA8932  Unknown               Unknown  Unknown
libmpifort.so<http://libmpifort.so>.12   00002B8E51860B1C  Unknown               Unknown  Unknown
pw.x               0000000000BCEE47  bcast_real_                37  mp_base.f90
pw.x               0000000000BAF7E4  mp_mp_mp_bcast_rv         395  mp.f90
pw.x               0000000000B6E881  pcdiaghg_                 363  cdiaghg.f90
pw.x               0000000000AF7304  protate_wfc_k_            256  rotate_wfc_k.f90
pw.x               0000000000681E82  rotate_wfc_                64  rotate_wfc.f90
pw.x               000000000064F519  diag_bands_               423  c_bands.f90
pw.x               000000000064CAD4  c_bands_                   99  c_bands.f90
pw.x               000000000040C014  electrons_scf_            552  electrons.f90
pw.x               0000000000408DBD  electrons_                146  electrons.f90
pw.x               000000000057582B  run_pwscf_                132  run_pwscf.f90
pw.x               0000000000406AC5  MAIN__                     77  pwscf.f90
pw.x               000000000040695E  Unknown               Unknown  Unknown
libc.so<http://libc.so>.6          00002B8E52CC4445  Unknown               Unknown  Unknown
pw.x               0000000000406869  Unknown               Unknown  Unknown
forrtl: error (69): process interrupted (SIGINT)
Image              PC                Routine            Line        Source
pw.x               0000000000EAAC45  Unknown               Unknown  Unknown
pw.x               0000000000EA8867  Unknown               Unknown  Unknown
pw.x               0000000000E3DC64  Unknown               Unknown  Unknown
pw.x               0000000000E3DA76  Unknown               Unknown  Unknown
pw.x               0000000000DC41B6  Unknown               Unknown  Unknown
pw.x               0000000000DCBB2E  Unknown               Unknown  Unknown
libpthread.so<http://libpthread.so>.0    00002ABAB008D6D0  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ABAAF5BC45C  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ABAAF6C7E39  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ABAAF6C7B32  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ABAAF5A12F9  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ABAAF5A0D5D  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ABAAF5A0BDC  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ABAAF5A0B0C  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ABAAF5A2932  Unknown               Unknown  Unknown
libmpifort.so<http://libmpifort.so>.12   00002ABAAF15AB1C  Unknown               Unknown  Unknown
pw.x               0000000000BCEE47  bcast_real_                37  mp_base.f90
pw.x               0000000000BAF7E4  mp_mp_mp_bcast_rv         395  mp.f90
pw.x               0000000000B6E881  pcdiaghg_                 363  cdiaghg.f90
pw.x               0000000000AF7304  protate_wfc_k_            256  rotate_wfc_k.f90
pw.x               0000000000681E82  rotate_wfc_                64  rotate_wfc.f90
pw.x               000000000064F519  diag_bands_               423  c_bands.f90
pw.x               000000000064CAD4  c_bands_                   99  c_bands.f90
pw.x               000000000040C014  electrons_scf_            552  electrons.f90
pw.x               0000000000408DBD  electrons_                146  electrons.f90
pw.x               000000000057582B  run_pwscf_                132  run_pwscf.f90
pw.x               0000000000406AC5  MAIN__                     77  pwscf.f90
pw.x               000000000040695E  Unknown               Unknown  Unknown
libc.so<http://libc.so>.6          00002ABAB05BE445  Unknown               Unknown  Unknown
pw.x               0000000000406869  Unknown               Unknown  Unknown
forrtl: error (69): process interrupted (SIGINT)
Image              PC                Routine            Line        Source
pw.x               0000000000EAAC45  Unknown               Unknown  Unknown
pw.x               0000000000EA8867  Unknown               Unknown  Unknown
pw.x               0000000000E3DC64  Unknown               Unknown  Unknown
pw.x               0000000000E3DA76  Unknown               Unknown  Unknown
pw.x               0000000000DC41B6  Unknown               Unknown  Unknown
pw.x               0000000000DCBB2E  Unknown               Unknown  Unknown
libpthread.so<http://libpthread.so>.0    00002ACB4BF866D0  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ACB4B4B5775  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ACB4B5C0E39  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ACB4B5C0B32  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ACB4B49A2F9  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ACB4B499D5D  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ACB4B499BDC  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ACB4B499B0C  Unknown               Unknown  Unknown
libmpi.so<http://libmpi.so>.12       00002ACB4B49B932  Unknown               Unknown  Unknown
libmpifort.so<http://libmpifort.so>.12   00002ACB4B053B1C  Unknown               Unknown  Unknown
pw.x               0000000000BCEE47  bcast_real_                37  mp_base.f90
pw.x               0000000000BAF7E4  mp_mp_mp_bcast_rv         395  mp.f90
pw.x               0000000000B6E881  pcdiaghg_                 363  cdiaghg.f90
pw.x               0000000000AF7304  protate_wfc_k_            256  rotate_wfc_k.f90
pw.x               0000000000681E82  rotate_wfc_                64  rotate_wfc.f90
pw.x               000000000064F519  diag_bands_               423  c_bands.f90
pw.x               000000000064CAD4  c_bands_                   99  c_bands.f90
pw.x               000000000040C014  electrons_scf_            552  electrons.f90
pw.x               0000000000408DBD  electrons_                146  electrons.f90
pw.x               000000000057582B  run_pwscf_                132  run_pwscf.f90
pw.x               0000000000406AC5  MAIN__                     77  pwscf.f90
pw.x               000000000040695E  Unknown               Unknown  Unknown
libc.so<http://libc.so>.6          00002ACB4C4B7445  Unknown               Unknown  Unknown
pw.x               0000000000406869  Unknown               Unknown  Unknown

Sample output below

Parallel version (MPI), running on    16 processors

MPI processes distributed on     1 nodes         R & G space division:  proc/nbgrp/npool/nimage =      16

Reading cobalt.scf<http://cobalt.scf>                               Message from routine read_cards :                DEPRECATED: no units specified in ATOMIC_POSITIONS card                                           Message from routine read_cards :

ATOMIC_POSITIONS: units set to alat

Current dimensions of program PWSCF are:

Max number of different atomic species (ntypx) = 10

Max number of k-points (npk) =  40000

Max angular momentum in pseudopotentials (lmaxx) =  3

Presently no symmetry can be used with electric field


file Co.pz<http://Co.pz>-n-kjpaw_psl.1.0.0.UPF<http://psl.1.0.0.UPF>: wavefunction(s)  4S 3D renormalized


Subspace diagonalization in iterative solution of the eigenvalue problem:

one sub-group per band group will be used

scalapack distributed-memory algorithm (size of sub-group:  2*  2 procs)


Parallelization info

--------------------

sticks:   dense  smooth     PW     G-vecs:    dense   smooth      PW

Min          13      13      4                 2449     2449     462

Max          14      14      5                 2516     2516     527

Sum         221     221     69                39945    39945    7777

Many thanks,

Alex Durie
PhD student
Open University
United Kingdom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20181208/71cb7f45/attachment.html>


More information about the users mailing list