[Pw_forum] mpi error using pw.x
Paolo Giannozzi
p.giannozzi at gmail.com
Sun May 15 21:10:53 CEST 2016
Your make.sys shows clear signs of mixup between ifort and gfortran. Please
verify that mpif90 calls ifort and not gfortran (or vice versa). Configure
issues a warning if this happens.
I have successfully run your test on a machine with some recent intel
compiler and intel mpi. The second output (run as mpirun -np 18 pw.x -nk
18....) is an example of what I mean by "type of parallelization": there
are many different parallelization levels in QE. This is on k-points (and
runs faster in this case on less processors than parallelization on plane
waves).
Paolo
On Sun, May 15, 2016 at 6:01 PM, Chong Wang <ch-wang at outlook.com> wrote:
> Hi,
>
>
> I have done more test:
>
> 1. intel mpi 2015 yields segment fault
>
> 2. intel mpi 2013 yields the same error here
>
> Did I do something wrong with compiling? Here's my make.sys:
>
>
> # make.sys. Generated from make.sys.in by configure.
>
>
> # compilation rules
>
>
> .SUFFIXES :
>
> .SUFFIXES : .o .c .f .f90
>
>
> # most fortran compilers can directly preprocess c-like directives: use
>
> # $(MPIF90) $(F90FLAGS) -c $<
>
> # if explicit preprocessing by the C preprocessor is needed, use:
>
> # $(CPP) $(CPPFLAGS) $< -o $*.F90
>
> # $(MPIF90) $(F90FLAGS) -c $*.F90 -o $*.o
>
> # remember the tabulator in the first column !!!
>
>
> .f90.o:
>
> $(MPIF90) $(F90FLAGS) -c $<
>
>
> # .f.o and .c.o: do not modify
>
>
> .f.o:
>
> $(F77) $(FFLAGS) -c $<
>
>
> .c.o:
>
> $(CC) $(CFLAGS) -c $<
>
>
>
>
> # Top QE directory, not used in QE but useful for linking QE libs with
> plugins
>
> # The following syntax should always point to TOPDIR:
>
> # $(dir $(abspath $(filter %make.sys,$(MAKEFILE_LIST))))
>
>
> TOPDIR = /home/wangc/temp/espresso-5.4.0
>
>
> # DFLAGS = precompilation options (possible arguments to -D and -U)
>
> # used by the C compiler and preprocessor
>
> # FDFLAGS = as DFLAGS, for the f90 compiler
>
> # See include/defs.h.README for a list of options and their meaning
>
> # With the exception of IBM xlf, FDFLAGS = $(DFLAGS)
>
> # For IBM xlf, FDFLAGS is the same as DFLAGS with separating commas
>
>
> # MANUAL_DFLAGS = additional precompilation option(s), if desired
>
> # BEWARE: it does not work for IBM xlf! Manually edit
> FDFLAGS
>
> MANUAL_DFLAGS =
>
> DFLAGS = -D__GFORTRAN -D__STD_F95 -D__DFTI -D__MPI -D__PARA
> -D__SCALAPACK
>
> FDFLAGS = $(DFLAGS) $(MANUAL_DFLAGS)
>
>
> # IFLAGS = how to locate directories with *.h or *.f90 file to be included
>
> # typically -I../include -I/some/other/directory/
>
> # the latter contains .e.g. files needed by FFT libraries
>
>
> IFLAGS = -I../include
> -I/opt/intel/compilers_and_libraries_2016.3.210/linux/mkl/include
>
>
> # MOD_FLAGS = flag used by f90 compiler to locate modules
>
> # Each Makefile defines the list of needed modules in MODFLAGS
>
>
> MOD_FLAG = -I
>
>
> # Compilers: fortran-90, fortran-77, C
>
> # If a parallel compilation is desired, MPIF90 should be a fortran-90
>
> # compiler that produces executables for parallel execution using MPI
>
> # (such as for instance mpif90, mpf90, mpxlf90,...);
>
> # otherwise, an ordinary fortran-90 compiler (f90, g95, xlf90, ifort,...)
>
> # If you have a parallel machine but no suitable candidate for MPIF90,
>
> # try to specify the directory containing "mpif.h" in IFLAGS
>
> # and to specify the location of MPI libraries in MPI_LIBS
>
>
> MPIF90 = mpif90
>
> #F90 = gfortran
>
> CC = cc
>
> F77 = gfortran
>
>
> # C preprocessor and preprocessing flags - for explicit preprocessing,
>
> # if needed (see the compilation rules above)
>
> # preprocessing flags must include DFLAGS and IFLAGS
>
>
> CPP = cpp
>
> CPPFLAGS = -P -C -traditional $(DFLAGS) $(IFLAGS)
>
>
> # compiler flags: C, F90, F77
>
> # C flags must include DFLAGS and IFLAGS
>
> # F90 flags must include MODFLAGS, IFLAGS, and FDFLAGS with appropriate
> syntax
>
>
> CFLAGS = -O3 $(DFLAGS) $(IFLAGS)
>
> F90FLAGS = $(FFLAGS) -x f95-cpp-input $(FDFLAGS) $(IFLAGS)
> $(MODFLAGS)
>
> FFLAGS = -O3 -g
>
>
> # compiler flags without optimization for fortran-77
>
> # the latter is NEEDED to properly compile dlamch.f, used by lapack
>
>
> FFLAGS_NOOPT = -O0 -g
>
>
> # compiler flag needed by some compilers when the main program is not
> fortran
>
> # Currently used for Yambo
>
>
> FFLAGS_NOMAIN =
>
>
> # Linker, linker-specific flags (if any)
>
> # Typically LD coincides with F90 or MPIF90, LD_LIBS is empty
>
>
> LD = mpif90
>
> LDFLAGS = -g -pthread
>
> LD_LIBS =
>
>
> # External Libraries (if any) : blas, lapack, fft, MPI
>
>
> # If you have nothing better, use the local copy :
>
> # BLAS_LIBS = /your/path/to/espresso/BLAS/blas.a
>
> # BLAS_LIBS_SWITCH = internal
>
>
> BLAS_LIBS = -lmkl_gf_lp64 -lmkl_sequential -lmkl_core
>
> BLAS_LIBS_SWITCH = external
>
>
> # If you have nothing better, use the local copy :
>
> # LAPACK_LIBS = /your/path/to/espresso/lapack-3.2/lapack.a
>
> # LAPACK_LIBS_SWITCH = internal
>
> # For IBM machines with essl (-D__ESSL): load essl BEFORE lapack !
>
> # remember that LAPACK_LIBS precedes BLAS_LIBS in loading order
>
>
> LAPACK_LIBS =
>
> LAPACK_LIBS_SWITCH = external
>
>
> ELPA_LIBS_SWITCH = disabled
>
> SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
>
>
> # nothing needed here if the the internal copy of FFTW is compiled
>
> # (needs -D__FFTW in DFLAGS)
>
>
> FFT_LIBS =
>
>
> # For parallel execution, the correct path to MPI libraries must
>
> # be specified in MPI_LIBS (except for IBM if you use mpxlf)
>
>
> MPI_LIBS =
>
>
> # IBM-specific: MASS libraries, if available and if -D__MASS is defined in
> FDFLAGS
>
>
> MASS_LIBS =
>
>
> # ar command and flags - for most architectures: AR = ar, ARFLAGS = ruv
>
>
> AR = ar
>
> ARFLAGS = ruv
>
>
> # ranlib command. If ranlib is not needed (it isn't in most cases) use
>
> # RANLIB = echo
>
>
> RANLIB = ranlib
>
>
> # all internal and external libraries - do not modify
>
>
> FLIB_TARGETS = all
>
>
> LIBOBJS = ../clib/clib.a ../iotk/src/libiotk.a
>
> LIBS = $(SCALAPACK_LIBS) $(LAPACK_LIBS) $(FFT_LIBS) $(BLAS_LIBS)
> $(MPI_LIBS) $(MASS_LIBS) $(LD_LIBS)
>
>
> # wget or curl - useful to download from network
>
> WGET = wget -O
>
>
> # Install directory - not currently used
>
> PREFIX = /usr/local
>
> Cheers!
>
>
> Chong Wang
> ------------------------------
> *From:* pw_forum-bounces at pwscf.org <pw_forum-bounces at pwscf.org> on behalf
> of Paolo Giannozzi <p.giannozzi at gmail.com>
> *Sent:* Sunday, May 15, 2016 8:28:26 PM
>
> *To:* PWSCF Forum
> *Subject:* Re: [Pw_forum] mpi error using pw.x
>
> It looks like a compiler/mpi bug, since there is nothing special in your
> input and in your execution, unless you find evidence that the problem is
> reproducible on other compiler/mpi versions.
>
> Paolo
>
> On Sun, May 15, 2016 at 10:11 AM, Chong Wang <ch-wang at outlook.com> wrote:
>
>> Hi,
>>
>>
>> Thank you for replying.
>>
>>
>> More details:
>>
>>
>> 1. input data:
>>
>> &control
>> calculation='scf'
>> restart_mode='from_scratch',
>> pseudo_dir = '../pot/',
>> outdir='./out/'
>> prefix='BaTiO3'
>> /
>> &system
>> nbnd = 48
>> ibrav = 0, nat = 5, ntyp = 3
>> ecutwfc = 50
>> occupations='smearing', smearing='gaussian', degauss=0.02
>> /
>> &electrons
>> conv_thr = 1.0e-8
>> /
>> ATOMIC_SPECIES
>> Ba 137.327 Ba.pbe-mt_fhi.UPF
>> Ti 204.380 Ti.pbe-mt_fhi.UPF
>> O 15.999 O.pbe-mt_fhi.UPF
>> ATOMIC_POSITIONS
>> Ba 0.0000000000000000 0.0000000000000000 0.0000000000000000
>> Ti 0.5000000000000000 0.5000000000000000 0.4819999933242795
>> O 0.5000000000000000 0.5000000000000000 0.0160000007599592
>> O 0.5000000000000000 -0.0000000000000000 0.5149999856948849
>> O 0.0000000000000000 0.5000000000000000 0.5149999856948849
>> K_POINTS (automatic)
>> 11 11 11 0 0 0
>> CELL_PARAMETERS {angstrom}
>> 3.999800000000001 0.000000000000000 0.000000000000000
>> 0.000000000000000 3.999800000000001 0.000000000000000
>> 0.000000000000000 0.000000000000000 4.018000000000000
>>
>> 2. number of processors:
>> I tested 24 cores and 8 cores, and both yield the same result.
>>
>> 3. type of parallelization:
>> I don't know your meaning. I execute pw.x by:
>> mpirun -np 24 pw.x < BTO.scf.in >> output
>>
>> 'which mpirun' output:
>> /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/bin/mpirun
>>
>> 4. when the error occurs:
>> in the middle of the run. The last a few lines of the output is
>> total cpu time spent up to now is 32.9 secs
>>
>> total energy = -105.97885119 Ry
>> Harris-Foulkes estimate = -105.99394457 Ry
>> estimated scf accuracy < 0.03479229 Ry
>>
>> iteration # 7 ecut= 50.00 Ry beta=0.70
>> Davidson diagonalization with overlap
>> ethr = 1.45E-04, avg # of iterations = 2.7
>>
>> total cpu time spent up to now is 37.3 secs
>>
>> total energy = -105.99039982 Ry
>> Harris-Foulkes estimate = -105.99025175 Ry
>> estimated scf accuracy < 0.00927902 Ry
>>
>> iteration # 8 ecut= 50.00 Ry beta=0.70
>> Davidson diagonalization with overlap
>>
>> 5. Error message:
>> Something like:
>> Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
>> PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,
>> remain_dims=0x7ffc03ae5f38, comm_new=0x7ffc03ae5e90) failed
>> PMPI_Cart_sub(178)...................:
>> MPIR_Comm_split_impl(270)............:
>> MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384
>> free on this process; ignore_id=0)
>> Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
>> PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,
>> remain_dims=0x7ffd10080408, comm_new=0x7ffd10080360) failed
>> PMPI_Cart_sub(178)...................:
>>
>> Cheers!
>>
>> Chong
>> ------------------------------
>> *From:* pw_forum-bounces at pwscf.org <pw_forum-bounces at pwscf.org> on
>> behalf of Paolo Giannozzi <p.giannozzi at gmail.com>
>> *Sent:* Sunday, May 15, 2016 3:43 PM
>> *To:* PWSCF Forum
>> *Subject:* Re: [Pw_forum] mpi error using pw.x
>>
>> Please tell us what is wrong and we will fix it.
>>
>> Seriously: nobody can answer your question unless you specify, as a
>> strict minimum, input data, number of processors and type of
>> parallelization that trigger the error, and where the error occurs (at
>> startup, later, in the middle of the run, ...).
>>
>> Paolo
>>
>> On Sun, May 15, 2016 at 7:50 AM, Chong Wang <ch-wang at outlook.com> wrote:
>>
>>> I compiled quantum espresso 5.4 with intel mpi and mkl 2016 update 3.
>>>
>>> However, when I ran pw.x the following errors were reported:
>>>
>>> ...
>>> MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384
>>> free on this process; ignore_id=0)
>>> Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
>>> PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,
>>> remain_dims=0x7ffde1391dd8, comm_new=0x7ffde1391d30) failed
>>> PMPI_Cart_sub(178)...................:
>>> MPIR_Comm_split_impl(270)............:
>>> MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384
>>> free on this process; ignore_id=0)
>>> Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
>>> PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,
>>> remain_dims=0x7ffc02ad7eb8, comm_new=0x7ffc02ad7e10) failed
>>> PMPI_Cart_sub(178)...................:
>>> MPIR_Comm_split_impl(270)............:
>>> MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384
>>> free on this process; ignore_id=0)
>>> Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
>>> PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3,
>>> remain_dims=0x7fffb24e60f8, comm_new=0x7fffb24e6050) failed
>>> PMPI_Cart_sub(178)...................:
>>> MPIR_Comm_split_impl(270)............:
>>> MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384
>>> free on this process; ignore_id=0)
>>>
>>> I googled and found out this might be caused by hitting os limits of
>>> number of opened files. However, After I increased number of opened files
>>> per process from 1024 to 40960, the error persists.
>>>
>>>
>>> What's wrong here?
>>>
>>>
>>> Chong Wang
>>>
>>> Ph. D. candidate
>>>
>>> Institute for Advanced Study, Tsinghua University, Beijing, 100084
>>>
>>> _______________________________________________
>>> Pw_forum mailing list
>>> Pw_forum at pwscf.org
>>> http://pwscf.org/mailman/listinfo/pw_forum
>>>
>>
>>
>>
>> --
>> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
>> Phone +39-0432-558216, fax +39-0432-558222
>>
>>
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://pwscf.org/mailman/listinfo/pw_forum
>>
>
>
>
> --
> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> Phone +39-0432-558216, fax +39-0432-558222
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
>
--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20160515/5bf94f18/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: prova2.24p
Type: application/octet-stream
Size: 13404 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20160515/5bf94f18/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: prova2.18k
Type: application/octet-stream
Size: 13155 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20160515/5bf94f18/attachment-0001.obj>
More information about the users
mailing list