[Pw_forum] mpi error using pw.x

Chong Wang ch-wang at outlook.com
Mon May 16 04:11:42 CEST 2016


Hi, Paolo


I have checked my mpif90 calls gfortran so there's no mix up. Can you kindly share with me your make.sys? Thanks in advance!


Best!


Chong Wang

________________________________
From: pw_forum-bounces at pwscf.org <pw_forum-bounces at pwscf.org> on behalf of Paolo Giannozzi <p.giannozzi at gmail.com>
Sent: Monday, May 16, 2016 3:10 AM
To: PWSCF Forum
Subject: Re: [Pw_forum] mpi error using pw.x

Your make.sys shows clear signs of mixup between ifort and gfortran. Please verify that mpif90 calls ifort and not gfortran (or vice versa). Configure issues a warning if this happens.

I have successfully run your test on a machine with some recent intel compiler and intel mpi. The second output (run as mpirun -np 18 pw.x -nk 18....) is an example of what I mean by "type of parallelization": there are many different parallelization levels in QE. This is on k-points (and runs faster in this case on less processors than parallelization on plane waves).

Paolo

On Sun, May 15, 2016 at 6:01 PM, Chong Wang <ch-wang at outlook.com<mailto:ch-wang at outlook.com>> wrote:

Hi,


I have done more test:

1. intel mpi 2015 yields segment fault

2. intel mpi 2013 yields the same error here

Did I do something wrong with compiling? Here's my make.sys:


# make.sys.  Generated from make.sys.in<http://make.sys.in> by configure.


# compilation rules


.SUFFIXES :

.SUFFIXES : .o .c .f .f90


# most fortran compilers can directly preprocess c-like directives: use

# $(MPIF90) $(F90FLAGS) -c $<

# if explicit preprocessing by the C preprocessor is needed, use:

# $(CPP) $(CPPFLAGS) $< -o $*.F90

# $(MPIF90) $(F90FLAGS) -c $*.F90 -o $*.o

# remember the tabulator in the first column !!!


.f90.o:

$(MPIF90) $(F90FLAGS) -c $<


# .f.o and .c.o: do not modify


.f.o:

$(F77) $(FFLAGS) -c $<


.c.o:

$(CC) $(CFLAGS)  -c $<




# Top QE directory, not used in QE but useful for linking QE libs with plugins

# The following syntax should always point to TOPDIR:

#   $(dir $(abspath $(filter %make.sys,$(MAKEFILE_LIST))))


TOPDIR = /home/wangc/temp/espresso-5.4.0


# DFLAGS  = precompilation options (possible arguments to -D and -U)

#           used by the C compiler and preprocessor

# FDFLAGS = as DFLAGS, for the f90 compiler

# See include/defs.h.README for a list of options and their meaning

# With the exception of IBM xlf, FDFLAGS = $(DFLAGS)

# For IBM xlf, FDFLAGS is the same as DFLAGS with separating commas


# MANUAL_DFLAGS  = additional precompilation option(s), if desired

#                  BEWARE: it does not work for IBM xlf! Manually edit FDFLAGS

MANUAL_DFLAGS  =

DFLAGS         =  -D__GFORTRAN -D__STD_F95 -D__DFTI -D__MPI -D__PARA -D__SCALAPACK

FDFLAGS        = $(DFLAGS) $(MANUAL_DFLAGS)


# IFLAGS = how to locate directories with *.h or *.f90 file to be included

#          typically -I../include -I/some/other/directory/

#          the latter contains .e.g. files needed by FFT libraries


IFLAGS         = -I../include -I/opt/intel/compilers_and_libraries_2016.3.210/linux/mkl/include


# MOD_FLAGS = flag used by f90 compiler to locate modules

# Each Makefile defines the list of needed modules in MODFLAGS


MOD_FLAG      = -I


# Compilers: fortran-90, fortran-77, C

# If a parallel compilation is desired, MPIF90 should be a fortran-90

# compiler that produces executables for parallel execution using MPI

# (such as for instance mpif90, mpf90, mpxlf90,...);

# otherwise, an ordinary fortran-90 compiler (f90, g95, xlf90, ifort,...)

# If you have a parallel machine but no suitable candidate for MPIF90,

# try to specify the directory containing "mpif.h" in IFLAGS

# and to specify the location of MPI libraries in MPI_LIBS


MPIF90         = mpif90

#F90           = gfortran

CC             = cc

F77            = gfortran


# C preprocessor and preprocessing flags - for explicit preprocessing,

# if needed (see the compilation rules above)

# preprocessing flags must include DFLAGS and IFLAGS


CPP            = cpp

CPPFLAGS       = -P -C -traditional $(DFLAGS) $(IFLAGS)


# compiler flags: C, F90, F77

# C flags must include DFLAGS and IFLAGS

# F90 flags must include MODFLAGS, IFLAGS, and FDFLAGS with appropriate syntax


CFLAGS         = -O3 $(DFLAGS) $(IFLAGS)

F90FLAGS       = $(FFLAGS) -x f95-cpp-input $(FDFLAGS) $(IFLAGS) $(MODFLAGS)

FFLAGS         = -O3 -g


# compiler flags without optimization for fortran-77

# the latter is NEEDED to properly compile dlamch.f, used by lapack


FFLAGS_NOOPT   = -O0 -g


# compiler flag needed by some compilers when the main program is not fortran

# Currently used for Yambo


FFLAGS_NOMAIN   =


# Linker, linker-specific flags (if any)

# Typically LD coincides with F90 or MPIF90, LD_LIBS is empty


LD             = mpif90

LDFLAGS        =  -g -pthread

LD_LIBS        =


# External Libraries (if any) : blas, lapack, fft, MPI


# If you have nothing better, use the local copy :

# BLAS_LIBS = /your/path/to/espresso/BLAS/blas.a

# BLAS_LIBS_SWITCH = internal


BLAS_LIBS      =   -lmkl_gf_lp64  -lmkl_sequential -lmkl_core

BLAS_LIBS_SWITCH = external


# If you have nothing better, use the local copy :

# LAPACK_LIBS = /your/path/to/espresso/lapack-3.2/lapack.a

# LAPACK_LIBS_SWITCH = internal

# For IBM machines with essl (-D__ESSL): load essl BEFORE lapack !

# remember that LAPACK_LIBS precedes BLAS_LIBS in loading order


LAPACK_LIBS    =

LAPACK_LIBS_SWITCH = external


ELPA_LIBS_SWITCH = disabled

SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64


# nothing needed here if the the internal copy of FFTW is compiled

# (needs -D__FFTW in DFLAGS)


FFT_LIBS       =


# For parallel execution, the correct path to MPI libraries must

# be specified in MPI_LIBS (except for IBM if you use mpxlf)


MPI_LIBS       =


# IBM-specific: MASS libraries, if available and if -D__MASS is defined in FDFLAGS


MASS_LIBS      =


# ar command and flags - for most architectures: AR = ar, ARFLAGS = ruv


AR             = ar

ARFLAGS        = ruv


# ranlib command. If ranlib is not needed (it isn't in most cases) use

# RANLIB = echo


RANLIB         = ranlib


# all internal and external libraries - do not modify


FLIB_TARGETS   = all


LIBOBJS        = ../clib/clib.a ../iotk/src/libiotk.a

LIBS           = $(SCALAPACK_LIBS) $(LAPACK_LIBS) $(FFT_LIBS) $(BLAS_LIBS) $(MPI_LIBS) $(MASS_LIBS) $(LD_LIBS)


# wget or curl - useful to download from network

WGET = wget -O


# Install directory - not currently used

PREFIX = /usr/local


Cheers!


Chong Wang

________________________________
From: pw_forum-bounces at pwscf.org<mailto:pw_forum-bounces at pwscf.org> <pw_forum-bounces at pwscf.org<mailto:pw_forum-bounces at pwscf.org>> on behalf of Paolo Giannozzi <p.giannozzi at gmail.com<mailto:p.giannozzi at gmail.com>>
Sent: Sunday, May 15, 2016 8:28:26 PM

To: PWSCF Forum
Subject: Re: [Pw_forum] mpi error using pw.x

It looks like a compiler/mpi bug, since there is nothing special in your input and in your execution, unless you find evidence that the problem is reproducible on other compiler/mpi versions.

Paolo

On Sun, May 15, 2016 at 10:11 AM, Chong Wang <ch-wang at outlook.com<mailto:ch-wang at outlook.com>> wrote:

Hi,


Thank you for replying.


More details:


1. input data:

&control
    calculation='scf'
    restart_mode='from_scratch',
    pseudo_dir = '../pot/',
    outdir='./out/'
    prefix='BaTiO3'
/
&system
    nbnd = 48
    ibrav = 0, nat = 5, ntyp = 3
    ecutwfc = 50
    occupations='smearing', smearing='gaussian', degauss=0.02
/
&electrons
    conv_thr = 1.0e-8
/
ATOMIC_SPECIES
 Ba 137.327 Ba.pbe-mt_fhi.UPF
 Ti 204.380 Ti.pbe-mt_fhi.UPF
 O  15.999  O.pbe-mt_fhi.UPF
ATOMIC_POSITIONS
 Ba 0.0000000000000000   0.0000000000000000   0.0000000000000000
 Ti 0.5000000000000000   0.5000000000000000   0.4819999933242795
 O  0.5000000000000000   0.5000000000000000   0.0160000007599592
 O  0.5000000000000000  -0.0000000000000000   0.5149999856948849
 O  0.0000000000000000   0.5000000000000000   0.5149999856948849
K_POINTS (automatic)
11 11 11 0 0 0
CELL_PARAMETERS {angstrom}
3.999800000000001       0.000000000000000       0.000000000000000
0.000000000000000       3.999800000000001       0.000000000000000
0.000000000000000       0.000000000000000       4.018000000000000


2. number of processors:
I tested 24 cores and 8 cores, and both yield the same result.

3. type of parallelization:
I don't know your meaning. I execute pw.x by:
mpirun  -np 24 pw.x < BTO.scf.in<http://BTO.scf.in> >> output

'which mpirun' output:
/opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/bin/mpirun

4. when the error occurs:
in the middle of the run. The last a few lines of the output is
     total cpu time spent up to now is       32.9 secs

     total energy              =    -105.97885119 Ry
     Harris-Foulkes estimate   =    -105.99394457 Ry
     estimated scf accuracy    <       0.03479229 Ry

     iteration #  7     ecut=    50.00 Ry     beta=0.70
     Davidson diagonalization with overlap
     ethr =  1.45E-04,  avg # of iterations =  2.7

     total cpu time spent up to now is       37.3 secs

     total energy              =    -105.99039982 Ry
     Harris-Foulkes estimate   =    -105.99025175 Ry
     estimated scf accuracy    <       0.00927902 Ry

     iteration #  8     ecut=    50.00 Ry     beta=0.70
     Davidson diagonalization with overlap

5. Error message:
Something like:
Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffc03ae5f38, comm_new=0x7ffc03ae5e90) failed
PMPI_Cart_sub(178)...................:
MPIR_Comm_split_impl(270)............:
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffd10080408, comm_new=0x7ffd10080360) failed
PMPI_Cart_sub(178)...................:

Cheers!

Chong
________________________________
From: pw_forum-bounces at pwscf.org<mailto:pw_forum-bounces at pwscf.org> <pw_forum-bounces at pwscf.org<mailto:pw_forum-bounces at pwscf.org>> on behalf of Paolo Giannozzi <p.giannozzi at gmail.com<mailto:p.giannozzi at gmail.com>>
Sent: Sunday, May 15, 2016 3:43 PM
To: PWSCF Forum
Subject: Re: [Pw_forum] mpi error using pw.x

Please tell us what is wrong and we will fix it.

Seriously: nobody can answer your question unless you specify, as a strict minimum, input data, number of processors and type of parallelization that trigger the error, and where the error occurs (at startup, later, in the middle of the run, ...).

Paolo

On Sun, May 15, 2016 at 7:50 AM, Chong Wang <ch-wang at outlook.com<mailto:ch-wang at outlook.com>> wrote:

I compiled quantum espresso 5.4 with intel mpi and mkl 2016 update 3.

However, when I ran pw.x the following errors were reported:

...
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffde1391dd8, comm_new=0x7ffde1391d30) failed
PMPI_Cart_sub(178)...................:
MPIR_Comm_split_impl(270)............:
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffc02ad7eb8, comm_new=0x7ffc02ad7e10) failed
PMPI_Cart_sub(178)...................:
MPIR_Comm_split_impl(270)............:
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7fffb24e60f8, comm_new=0x7fffb24e6050) failed
PMPI_Cart_sub(178)...................:
MPIR_Comm_split_impl(270)............:
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)


I googled and found out this might be caused by hitting os limits of number of opened files. However, After I increased number of opened files per process from 1024 to 40960, the error persists.


What's wrong here?


Chong Wang

Ph. D. candidate

Institute for Advanced Study, Tsinghua University, Beijing, 100084

_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org<mailto:Pw_forum at pwscf.org>
http://pwscf.org/mailman/listinfo/pw_forum



--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216<tel:%2B39-0432-558216>, fax +39-0432-558222<tel:%2B39-0432-558222>


_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org<mailto:Pw_forum at pwscf.org>
http://pwscf.org/mailman/listinfo/pw_forum



--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216<tel:%2B39-0432-558216>, fax +39-0432-558222<tel:%2B39-0432-558222>


_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org<mailto:Pw_forum at pwscf.org>
http://pwscf.org/mailman/listinfo/pw_forum



--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20160516/86d5f1a1/attachment.html>


More information about the users mailing list