[QE-users] error in QE6.7: too many communicators

Fri Jan 8 13:21:27 CET 2021

Dear QE users and developers,

I have been using QE6.4.1 version quite normally. Few days ago, I tried to compile the latest QE6.7 with the same Intel compiler. The compilation went smoothly, but there were two issues in the calculation.

First, the calculation of QE6.7 seems much slower than QE6.4.1

Following is the second q point in phonon calculation of Graphene. As you can see, it 's about 10 times slower for every single step.(the input files I used are in attachment)

################################

QE6.4.1

     Representation #  1 mode #   1

     Self-consistent Calculation

      iter #   1 total cpu time :   236.4 secs   av.it.:   6.5
      thresh= 1.000E-02 alpha_mix =  0.300 |ddv_scf|^2 =  2.305E-05

      iter #   2 total cpu time :   246.7 secs   av.it.:   8.7
      thresh= 4.801E-04 alpha_mix =  0.300 |ddv_scf|^2 =  1.071E-05

      iter #   3 total cpu time :   256.4 secs   av.it.:   8.0
      thresh= 3.273E-04 alpha_mix =  0.300 |ddv_scf|^2 =  2.762E-09

      iter #   4 total cpu time :   268.9 secs   av.it.:  11.6
      thresh= 5.256E-06 alpha_mix =  0.300 |ddv_scf|^2 =  4.235E-10

      iter #   5 total cpu time :   281.8 secs   av.it.:  12.0
      thresh= 2.058E-06 alpha_mix =  0.300 |ddv_scf|^2 =  7.233E-12

      iter #   6 total cpu time :   294.1 secs   av.it.:  11.3
      thresh= 2.689E-07 alpha_mix =  0.300 |ddv_scf|^2 =  1.057E-12

      iter #   7 total cpu time :   306.7 secs   av.it.:  11.6
      thresh= 1.028E-07 alpha_mix =  0.300 |ddv_scf|^2 =  1.011E-13

      iter #   8 total cpu time :   318.5 secs   av.it.:  10.5
      thresh= 3.179E-08 alpha_mix =  0.300 |ddv_scf|^2 =  2.282E-14

      iter #   9 total cpu time :   330.9 secs   av.it.:  11.4
      thresh= 1.511E-08 alpha_mix =  0.300 |ddv_scf|^2 =  8.720E-15

QE6.7

     Representation #   1 mode #   1

     Self-consistent Calculation

      iter #   1 total cpu time :   585.9 secs   av.it.:   6.8
      thresh= 1.000E-02 alpha_mix =  0.300 |ddv_scf|^2 =  7.246E-06

      iter #   2 total cpu time :   684.9 secs   av.it.:   9.5
      thresh= 2.692E-04 alpha_mix =  0.300 |ddv_scf|^2 =  1.917E-06

      iter #   3 total cpu time :   777.8 secs   av.it.:   9.0
      thresh= 1.385E-04 alpha_mix =  0.300 |ddv_scf|^2 =  2.398E-09

      iter #   4 total cpu time :   890.5 secs   av.it.:  11.4
      thresh= 4.897E-06 alpha_mix =  0.300 |ddv_scf|^2 =  1.623E-10

      iter #   5 total cpu time :  1006.1 secs   av.it.:  11.6
      thresh= 1.274E-06 alpha_mix =  0.300 |ddv_scf|^2 =  7.910E-12

      iter #   6 total cpu time :  1116.9 secs   av.it.:  11.4
      thresh= 2.812E-07 alpha_mix =  0.300 |ddv_scf|^2 =  1.176E-12

      iter #   7 total cpu time :  1231.4 secs   av.it.:  11.5
      thresh= 1.084E-07 alpha_mix =  0.300 |ddv_scf|^2 =  4.460E-14

      iter #   8 total cpu time :  1341.1 secs   av.it.:  11.2
      thresh= 2.112E-08 alpha_mix =  0.300 |ddv_scf|^2 =  1.019E-15

      iter #   9 total cpu time :  1452.8 secs   av.it.:  11.3
      thresh= 3.193E-09 alpha_mix =  0.300 |ddv_scf|^2 =  1.677E-16

###################################################################

second, the phonon running crashed after the second q point calculation finished in QE6.7, while it ran to the end successfully in QE6.4.1

If I restart the phonon calculation by recover=.true. , it can go on running, but crashed again after the third q point was finished.

the error message is as following:

Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc400000d, color=2, key=2, new_comm=0x7ffe99ed9478) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (16357/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc400000d, color=3, key=1, new_comm=0x7ffe5fe7bc78) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (16357/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc400000d, color=3, key=3, new_comm=0x7ffc20a56278) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (16357/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc400000d, color=4, key=0, new_comm=0x7ffea22b5478) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (16357/16384 free on this process; ignore_id=0)

For QE6.4.1 and 6.7, both are compiled with Intel2018 as following (I also put the make.inc in the attachment):

source /THL7/software/intel2018.4/compilers_and_libraries_2018.5.274/linux/bin/compilervars.sh intel64

./configure --prefix=/THL7/home/soft/QuantumEspresso/qe-6.7

--with-scalapack=intel CC="icc" FC="ifort" F77="ifort" MPICC="mpiicc" MPIF90="mpiifort"

DFLAGS="-D__DFTI -D__MPI -D__SCALAPACK -D__FFTW"

LDFLAGS=-shared-intel

FFT_LIBS="-L/THL7/software/intel2018.4/compilers_and_libraries_2018.5.274/linux/mkl/interfaces/fftw3xf -lfftw3xf_intel"

After configuration, it shows:

The following libraries have been found:
  BLAS_LIBS=  -lmkl_intel_lp64  -lmkl_sequential -lmkl_core
  LAPACK_LIBS=
  SCALAPACK_LIBS=-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
  FFT_LIBS=-L/THL7/software/intel2018.4/compilers_and_libraries_2018.5.274/linux/mkl/interfaces/fftw3xf -lfftw3xf_intel

It's weird for me that the same compilation settings work well for QE6.4.1, but failed for QE6.7. I guess maybe some settings are no longer suitable for QE6.7.

For the first issue, I have also tested it in QE6.5 and QE6.6, the calculation are as slow as QE6.7. Someone can tell me what caused the slower calculation after the 6.4.1 version?
For the second issue, I have check the Intel communiny. https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/MPI-error-while-running-SIESTA-code/td-p/1073134 the staff said the same problem has been fixed in MKL 2018u3.  well, one side, I'm using the Intel2018u4; the other side, it works well for QE6.4.1. Someone can give me some advice?

Sorry for such long post and such terrible English.

Best regards,

Jian-qi Huang

Jian-qi Huang

Magnetism and Magnetic Materials Division
Institute of Metal Research
Chinese Academy of Sciences
72 Wenhua Road, Shenyang 110016, China

email:jqhuang16b at imr.ac.cn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210108/1efcbfb4/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scf.in
Type: application/octet-stream
Size: 2148 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210108/1efcbfb4/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ph.in
Type: application/octet-stream
Size: 358 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210108/1efcbfb4/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: make.inc
Type: application/octet-stream
Size: 6272 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210108/1efcbfb4/attachment-0002.obj>