[QE-users] MPI error in pw.x
Alex.Durie
alex.durie at open.ac.uk
Sat Dec 8 21:55:25 CET 2018
Dear experts,
I have been running pw.x with multiple processes quite successfully, however when the number of processes is high enough, such that the space group has more than 7 processes, where the subspace diagonalization no longer uses a serial algorithm, the program crashes abruptly at about the 10th iteration with the following errors;
Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffe0b27a6e8, comm_new=0x7ffe0b27a640) failed
PMPI_Cart_sub(178)...................:
MPIR_Comm_split_impl(270)............:
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffefaee7ce8, comm_new=0x7ffefaee7c40) failed
PMPI_Cart_sub(178)...................:
MPIR_Comm_split_impl(270)............:
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7fffc482d168, comm_new=0x7fffc482d0c0) failed
PMPI_Cart_sub(178)...................:
MPIR_Comm_split_impl(270)............:
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffe23a022e8, comm_new=0x7ffe23a02240) failed
PMPI_Cart_sub(178)...................:
MPIR_Comm_split_impl(270)............:
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
forrtl: error (69): process interrupted (SIGINT)
Image PC Routine Line Source
pw.x 0000000000EAAC45 Unknown Unknown Unknown
pw.x 0000000000EA8867 Unknown Unknown Unknown
pw.x 0000000000E3DC64 Unknown Unknown Unknown
pw.x 0000000000E3DA76 Unknown Unknown Unknown
pw.x 0000000000DC41B6 Unknown Unknown Unknown
pw.x 0000000000DCBB2E Unknown Unknown Unknown
libpthread.so<http://libpthread.so>.0 00002BA339B746D0 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002BA3390A345F Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002BA3391AEE39 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002BA3391AEB32 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002BA3390882F9 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002BA339087D5D Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002BA339087BDC Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002BA339087B0C Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002BA339089932 Unknown Unknown Unknown
libmpifort.so<http://libmpifort.so>.12 00002BA338C41B1C Unknown Unknown Unknown
pw.x 0000000000BCEE47 bcast_real_ 37 mp_base.f90
pw.x 0000000000BAF7E4 mp_mp_mp_bcast_rv 395 mp.f90
pw.x 0000000000B6E881 pcdiaghg_ 363 cdiaghg.f90
pw.x 0000000000AF7304 protate_wfc_k_ 256 rotate_wfc_k.f90
pw.x 0000000000681E82 rotate_wfc_ 64 rotate_wfc.f90
pw.x 000000000064F519 diag_bands_ 423 c_bands.f90
pw.x 000000000064CAD4 c_bands_ 99 c_bands.f90
pw.x 000000000040C014 electrons_scf_ 552 electrons.f90
pw.x 0000000000408DBD electrons_ 146 electrons.f90
pw.x 000000000057582B run_pwscf_ 132 run_pwscf.f90
pw.x 0000000000406AC5 MAIN__ 77 pwscf.f90
pw.x 000000000040695E Unknown Unknown Unknown
libc.so<http://libc.so>.6 00002BA33A0A5445 Unknown Unknown Unknown
pw.x 0000000000406869 Unknown Unknown Unknown
forrtl: error (69): process interrupted (SIGINT)
Image PC Routine Line Source
pw.x 0000000000EAAC45 Unknown Unknown Unknown
pw.x 0000000000EA8867 Unknown Unknown Unknown
pw.x 0000000000E3DC64 Unknown Unknown Unknown
pw.x 0000000000E3DA76 Unknown Unknown Unknown
pw.x 0000000000DC41B6 Unknown Unknown Unknown
pw.x 0000000000DCBB2E Unknown Unknown Unknown
libpthread.so<http://libpthread.so>.0 00002B8E527936D0 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002B8E51CC276E Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002B8E51DCDE39 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002B8E51DCDB32 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002B8E51CA72F9 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002B8E51CA6D5D Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002B8E51CA6BDC Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002B8E51CA6B0C Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002B8E51CA8932 Unknown Unknown Unknown
libmpifort.so<http://libmpifort.so>.12 00002B8E51860B1C Unknown Unknown Unknown
pw.x 0000000000BCEE47 bcast_real_ 37 mp_base.f90
pw.x 0000000000BAF7E4 mp_mp_mp_bcast_rv 395 mp.f90
pw.x 0000000000B6E881 pcdiaghg_ 363 cdiaghg.f90
pw.x 0000000000AF7304 protate_wfc_k_ 256 rotate_wfc_k.f90
pw.x 0000000000681E82 rotate_wfc_ 64 rotate_wfc.f90
pw.x 000000000064F519 diag_bands_ 423 c_bands.f90
pw.x 000000000064CAD4 c_bands_ 99 c_bands.f90
pw.x 000000000040C014 electrons_scf_ 552 electrons.f90
pw.x 0000000000408DBD electrons_ 146 electrons.f90
pw.x 000000000057582B run_pwscf_ 132 run_pwscf.f90
pw.x 0000000000406AC5 MAIN__ 77 pwscf.f90
pw.x 000000000040695E Unknown Unknown Unknown
libc.so<http://libc.so>.6 00002B8E52CC4445 Unknown Unknown Unknown
pw.x 0000000000406869 Unknown Unknown Unknown
forrtl: error (69): process interrupted (SIGINT)
Image PC Routine Line Source
pw.x 0000000000EAAC45 Unknown Unknown Unknown
pw.x 0000000000EA8867 Unknown Unknown Unknown
pw.x 0000000000E3DC64 Unknown Unknown Unknown
pw.x 0000000000E3DA76 Unknown Unknown Unknown
pw.x 0000000000DC41B6 Unknown Unknown Unknown
pw.x 0000000000DCBB2E Unknown Unknown Unknown
libpthread.so<http://libpthread.so>.0 00002ABAB008D6D0 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ABAAF5BC45C Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ABAAF6C7E39 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ABAAF6C7B32 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ABAAF5A12F9 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ABAAF5A0D5D Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ABAAF5A0BDC Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ABAAF5A0B0C Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ABAAF5A2932 Unknown Unknown Unknown
libmpifort.so<http://libmpifort.so>.12 00002ABAAF15AB1C Unknown Unknown Unknown
pw.x 0000000000BCEE47 bcast_real_ 37 mp_base.f90
pw.x 0000000000BAF7E4 mp_mp_mp_bcast_rv 395 mp.f90
pw.x 0000000000B6E881 pcdiaghg_ 363 cdiaghg.f90
pw.x 0000000000AF7304 protate_wfc_k_ 256 rotate_wfc_k.f90
pw.x 0000000000681E82 rotate_wfc_ 64 rotate_wfc.f90
pw.x 000000000064F519 diag_bands_ 423 c_bands.f90
pw.x 000000000064CAD4 c_bands_ 99 c_bands.f90
pw.x 000000000040C014 electrons_scf_ 552 electrons.f90
pw.x 0000000000408DBD electrons_ 146 electrons.f90
pw.x 000000000057582B run_pwscf_ 132 run_pwscf.f90
pw.x 0000000000406AC5 MAIN__ 77 pwscf.f90
pw.x 000000000040695E Unknown Unknown Unknown
libc.so<http://libc.so>.6 00002ABAB05BE445 Unknown Unknown Unknown
pw.x 0000000000406869 Unknown Unknown Unknown
forrtl: error (69): process interrupted (SIGINT)
Image PC Routine Line Source
pw.x 0000000000EAAC45 Unknown Unknown Unknown
pw.x 0000000000EA8867 Unknown Unknown Unknown
pw.x 0000000000E3DC64 Unknown Unknown Unknown
pw.x 0000000000E3DA76 Unknown Unknown Unknown
pw.x 0000000000DC41B6 Unknown Unknown Unknown
pw.x 0000000000DCBB2E Unknown Unknown Unknown
libpthread.so<http://libpthread.so>.0 00002ACB4BF866D0 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ACB4B4B5775 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ACB4B5C0E39 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ACB4B5C0B32 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ACB4B49A2F9 Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ACB4B499D5D Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ACB4B499BDC Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ACB4B499B0C Unknown Unknown Unknown
libmpi.so<http://libmpi.so>.12 00002ACB4B49B932 Unknown Unknown Unknown
libmpifort.so<http://libmpifort.so>.12 00002ACB4B053B1C Unknown Unknown Unknown
pw.x 0000000000BCEE47 bcast_real_ 37 mp_base.f90
pw.x 0000000000BAF7E4 mp_mp_mp_bcast_rv 395 mp.f90
pw.x 0000000000B6E881 pcdiaghg_ 363 cdiaghg.f90
pw.x 0000000000AF7304 protate_wfc_k_ 256 rotate_wfc_k.f90
pw.x 0000000000681E82 rotate_wfc_ 64 rotate_wfc.f90
pw.x 000000000064F519 diag_bands_ 423 c_bands.f90
pw.x 000000000064CAD4 c_bands_ 99 c_bands.f90
pw.x 000000000040C014 electrons_scf_ 552 electrons.f90
pw.x 0000000000408DBD electrons_ 146 electrons.f90
pw.x 000000000057582B run_pwscf_ 132 run_pwscf.f90
pw.x 0000000000406AC5 MAIN__ 77 pwscf.f90
pw.x 000000000040695E Unknown Unknown Unknown
libc.so<http://libc.so>.6 00002ACB4C4B7445 Unknown Unknown Unknown
pw.x 0000000000406869 Unknown Unknown Unknown
Sample output below
Parallel version (MPI), running on 16 processors
MPI processes distributed on 1 nodes R & G space division: proc/nbgrp/npool/nimage = 16
Reading cobalt.scf<http://cobalt.scf> Message from routine read_cards : DEPRECATED: no units specified in ATOMIC_POSITIONS card Message from routine read_cards :
ATOMIC_POSITIONS: units set to alat
Current dimensions of program PWSCF are:
Max number of different atomic species (ntypx) = 10
Max number of k-points (npk) = 40000
Max angular momentum in pseudopotentials (lmaxx) = 3
Presently no symmetry can be used with electric field
file Co.pz<http://Co.pz>-n-kjpaw_psl.1.0.0.UPF<http://psl.1.0.0.UPF>: wavefunction(s) 4S 3D renormalized
Subspace diagonalization in iterative solution of the eigenvalue problem:
one sub-group per band group will be used
scalapack distributed-memory algorithm (size of sub-group: 2* 2 procs)
Parallelization info
--------------------
sticks: dense smooth PW G-vecs: dense smooth PW
Min 13 13 4 2449 2449 462
Max 14 14 5 2516 2516 527
Sum 221 221 69 39945 39945 7777
Many thanks,
Alex Durie
PhD student
Open University
United Kingdom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20181208/71cb7f45/attachment.html>
More information about the users
mailing list