[QE-users] error in QE6.7: too many communicators
jqhuang16b at imr.ac.cn
jqhuang16b at imr.ac.cn
Fri Jan 8 13:21:27 CET 2021
Dear QE users and developers,
I have been using QE6.4.1 version quite normally. Few days ago, I tried to compile the latest QE6.7 with the same Intel compiler. The compilation went smoothly, but there were two issues in the calculation.
First, the calculation of QE6.7 seems much slower than QE6.4.1
Following is the second q point in phonon calculation of Graphene. As you can see, it 's about 10 times slower for every single step.(the input files I used are in attachment)
################################
QE6.4.1
Representation # 1 mode # 1
Self-consistent Calculation
iter # 1 total cpu time : 236.4 secs av.it.: 6.5
thresh= 1.000E-02 alpha_mix = 0.300 |ddv_scf|^2 = 2.305E-05
iter # 2 total cpu time : 246.7 secs av.it.: 8.7
thresh= 4.801E-04 alpha_mix = 0.300 |ddv_scf|^2 = 1.071E-05
iter # 3 total cpu time : 256.4 secs av.it.: 8.0
thresh= 3.273E-04 alpha_mix = 0.300 |ddv_scf|^2 = 2.762E-09
iter # 4 total cpu time : 268.9 secs av.it.: 11.6
thresh= 5.256E-06 alpha_mix = 0.300 |ddv_scf|^2 = 4.235E-10
iter # 5 total cpu time : 281.8 secs av.it.: 12.0
thresh= 2.058E-06 alpha_mix = 0.300 |ddv_scf|^2 = 7.233E-12
iter # 6 total cpu time : 294.1 secs av.it.: 11.3
thresh= 2.689E-07 alpha_mix = 0.300 |ddv_scf|^2 = 1.057E-12
iter # 7 total cpu time : 306.7 secs av.it.: 11.6
thresh= 1.028E-07 alpha_mix = 0.300 |ddv_scf|^2 = 1.011E-13
iter # 8 total cpu time : 318.5 secs av.it.: 10.5
thresh= 3.179E-08 alpha_mix = 0.300 |ddv_scf|^2 = 2.282E-14
iter # 9 total cpu time : 330.9 secs av.it.: 11.4
thresh= 1.511E-08 alpha_mix = 0.300 |ddv_scf|^2 = 8.720E-15
QE6.7
Representation # 1 mode # 1
Self-consistent Calculation
iter # 1 total cpu time : 585.9 secs av.it.: 6.8
thresh= 1.000E-02 alpha_mix = 0.300 |ddv_scf|^2 = 7.246E-06
iter # 2 total cpu time : 684.9 secs av.it.: 9.5
thresh= 2.692E-04 alpha_mix = 0.300 |ddv_scf|^2 = 1.917E-06
iter # 3 total cpu time : 777.8 secs av.it.: 9.0
thresh= 1.385E-04 alpha_mix = 0.300 |ddv_scf|^2 = 2.398E-09
iter # 4 total cpu time : 890.5 secs av.it.: 11.4
thresh= 4.897E-06 alpha_mix = 0.300 |ddv_scf|^2 = 1.623E-10
iter # 5 total cpu time : 1006.1 secs av.it.: 11.6
thresh= 1.274E-06 alpha_mix = 0.300 |ddv_scf|^2 = 7.910E-12
iter # 6 total cpu time : 1116.9 secs av.it.: 11.4
thresh= 2.812E-07 alpha_mix = 0.300 |ddv_scf|^2 = 1.176E-12
iter # 7 total cpu time : 1231.4 secs av.it.: 11.5
thresh= 1.084E-07 alpha_mix = 0.300 |ddv_scf|^2 = 4.460E-14
iter # 8 total cpu time : 1341.1 secs av.it.: 11.2
thresh= 2.112E-08 alpha_mix = 0.300 |ddv_scf|^2 = 1.019E-15
iter # 9 total cpu time : 1452.8 secs av.it.: 11.3
thresh= 3.193E-09 alpha_mix = 0.300 |ddv_scf|^2 = 1.677E-16
###################################################################
second, the phonon running crashed after the second q point calculation finished in QE6.7, while it ran to the end successfully in QE6.4.1
If I restart the phonon calculation by recover=.true. , it can go on running, but crashed again after the third q point was finished.
the error message is as following:
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc400000d, color=2, key=2, new_comm=0x7ffe99ed9478) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (16357/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc400000d, color=3, key=1, new_comm=0x7ffe5fe7bc78) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (16357/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc400000d, color=3, key=3, new_comm=0x7ffc20a56278) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (16357/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc400000d, color=4, key=0, new_comm=0x7ffea22b5478) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(672): Too many communicators (16357/16384 free on this process; ignore_id=0)
For QE6.4.1 and 6.7, both are compiled with Intel2018 as following (I also put the make.inc in the attachment):
source /THL7/software/intel2018.4/compilers_and_libraries_2018.5.274/linux/bin/compilervars.sh intel64
./configure --prefix=/THL7/home/soft/QuantumEspresso/qe-6.7
--with-scalapack=intel CC="icc" FC="ifort" F77="ifort" MPICC="mpiicc" MPIF90="mpiifort"
DFLAGS="-D__DFTI -D__MPI -D__SCALAPACK -D__FFTW"
LDFLAGS=-shared-intel
FFT_LIBS="-L/THL7/software/intel2018.4/compilers_and_libraries_2018.5.274/linux/mkl/interfaces/fftw3xf -lfftw3xf_intel"
After configuration, it shows:
The following libraries have been found:
BLAS_LIBS= -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
LAPACK_LIBS=
SCALAPACK_LIBS=-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
FFT_LIBS=-L/THL7/software/intel2018.4/compilers_and_libraries_2018.5.274/linux/mkl/interfaces/fftw3xf -lfftw3xf_intel
It's weird for me that the same compilation settings work well for QE6.4.1, but failed for QE6.7. I guess maybe some settings are no longer suitable for QE6.7.
For the first issue, I have also tested it in QE6.5 and QE6.6, the calculation are as slow as QE6.7. Someone can tell me what caused the slower calculation after the 6.4.1 version?
For the second issue, I have check the Intel communiny. https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/MPI-error-while-running-SIESTA-code/td-p/1073134 the staff said the same problem has been fixed in MKL 2018u3. well, one side, I'm using the Intel2018u4; the other side, it works well for QE6.4.1. Someone can give me some advice?
Sorry for such long post and such terrible English.
Best regards,
Jian-qi Huang
Jian-qi Huang
Magnetism and Magnetic Materials Division
Institute of Metal Research
Chinese Academy of Sciences
72 Wenhua Road, Shenyang 110016, China
email:jqhuang16b at imr.ac.cn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210108/1efcbfb4/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scf.in
Type: application/octet-stream
Size: 2148 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210108/1efcbfb4/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ph.in
Type: application/octet-stream
Size: 358 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210108/1efcbfb4/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: make.inc
Type: application/octet-stream
Size: 6272 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20210108/1efcbfb4/attachment-0002.obj>
More information about the users
mailing list