[Pw_forum] Fatal error in PMPI_Group_incl, possibly related to ScaLAPACK libraries

Ryan Herchig rch at mail.usf.edu
Tue Dec 6 19:02:48 CET 2016


Paolo,

  Thank you for the response.  I am running the code using

module purge
export OMP_NUM_THREADS=1
export PSM_RANKS_PER_CONTEXT=4
module add compilers/intel/2015_cluster_xe

#mpirun -ppn ${SLURM_NTASKS_PER_NODE} -n ${SLURM_NTASKS}
/home/r/rch/espresso/openMP_EXE/pw.x -npools 4 -ntg 2 -in file.in

with

#SBATCH --ntasks-per-node=24
#SBATCH -N 2

If I remove the -npools 4 -ntg flags and rerun, I receive the same error
though it is not printed 48 times as I would expect but only twice.  If I
take your suggestion and change the mpirun line to

mpirun -ppn ${SLURM_NTASKS_PER_NODE} -n ${SLURM_NTASKS}
/home/r/rch/espresso/openMP_EXE/pw.x -nd 1 -in file.in

it runs fine with 48 processors.  Combining all these flags with

mpirun -ppn ${SLURM_NTASKS_PER_NODE} -n ${SLURM_NTASKS}
/home/r/rch/espresso/openMP_EXE/pw.x -nd 1 -npools 4 -ntg 2 -in file.in

also works well.  Including the -nd 1 flag seems to have fixed the problem
and it also runs well with 96 processors and using different k-point and
task group parallelizations.  Thank you for telling me about that, I have
never heard of that flag.  If you would like for me to provide any
additional information about the way I have ran the calculations for the
PW_forum record, please let me know.

       Thank you, Ryan Herchig

        University of South Florida, Tampa FL, Department of Physics


On Tue, Dec 6, 2016 at 11:19 AM, Paolo Giannozzi <p.giannozzi at gmail.com>
wrote:

> I am not convinced that the problem you mention is the same as yours. In
> order to figure out if the problem arises from Scalapack, you should remove
> __SCALAPACK from DFLAGS and recompile: the code will use (much slower)
> internal routines for parallel dense-matrix diagonalization. You may also
> try to run with no dense-matrix diagonalization (-nd 1, not sure it is
> honored though).
>
> You should also report how your are running your code and, if using exotic
> parallelizations like "band groups" (-nb N), check if the problem you have
> is related to its usage
>
> Paolo
>
>
> On Thu, Dec 1, 2016 at 11:37 PM, Ryan Herchig <rch at mail.usf.edu> wrote:
>
>> Hello all,
>>
>>     I am running pw.x in Quantum Espresso version 5.4.0, however if I try
>> and run the job using more than 2 nodes with 8 cores each I receive the
>> following error :
>>
>> Fatal error in PMPI_Group_incl: Invalid rank, error stack:
>> PMPI_Group_incl(185).............: MPI_Group_incl(group=0x88000004, n=4,
>> ranks=0x2852700, new_group=0x7fff57564668) failed
>> MPIR_Group_check_valid_ranks(253): Invalid rank in rank array at index
>> 3; value is 33 but must be in the range 0 to 31
>>
>> I am building/running on a local cluster maintained by the University I
>> attend.  The specifications for the nodes are 2 x Intel Xeon E5-2670 (Eight
>> Core) 32GB QDR InfiniBand. I found in a previous thread
>>
>> https://www.mail-archive.com/pw_forum@pwscf.org/msg27702.html
>>
>> involving espresso-5.3.0 where another user seemed to be experiencing the
>> same issue where it was determined that "The problem is related to the
>> obscure hacks needed to convince Scalapack to work in a subgroup of
>> processors."  The suggestion in this post was to change a line in
>> Modules/mp_global.f90 and recompile.  However I am running spin-collinear
>> vdW-DF calculations which requires at least version 5.4.0 I believe and the
>> lines in the subroutine found in mp_global.f90 has changed; furthermore
>> following the suggestion of the previous post does not fix the issue.  It
>> instead produces the following compilation error :
>>
>> mp_global.f90(97): error #6631: A non-optional actual argument must be
>> present when invoking a procedure with an explicit interface.
>> [NPARENT_COMM]
>>     CALL mp_start_diag  ( ndiag_, intra_BGRP_comm )
>> ---------^
>> mp_global.f90(97): error #6631: A non-optional actual argument must be
>> present when invoking a procedure with an explicit interface.
>> [MY_PARENT_ID]
>>     CALL mp_start_diag  ( ndiag_, intra_BGRP_comm )
>> ---------^
>> compilation aborted for mp_global.f90 (code 1)
>>
>>
>> Does this problem with the ScaLAPACK libraries persist in the newer
>> versions or could these errors have a separate origin?  Possibly something
>> I am doing wrong during the build?  I have included the make.sys that I am
>> using for "make pw" below.  If the error is due to the ScaLAPACK libraries,
>> is there a workaround which could allow the use of additional processors
>> when running calculations?  Thank you in advance.
>>
>>                            Thank you, Ryan Herchig
>>
>>                            University of South Florida, Department of
>> Physics
>>
>>
>> .SUFFIXES :
>> .SUFFIXES : .o .c .f .f90
>>
>> .f90.o:
>>     $(MPIF90) $(F90FLAGS) -c $<
>>
>> # .f.o and .c.o: do not modify
>>
>> .f.o:
>>     $(F77) $(FFLAGS) -c $<
>>
>> .c.o:
>>     $(CC) $(CFLAGS)  -c $<
>>
>> TOPDIR = /work/r/rch/espresso-5.4.0
>>
>> MANUAL_DFLAGS  =
>> DFLAGS         =  -D__INTEL -D__FFTW3 -D__MPI -D__PARA -D__SCALAPACK
>> FDFLAGS        = $(DFLAGS) $(MANUAL_DFLAGS)
>>
>> IFLAGS         = -I../include -I/apps/intel/2015/composer_xe
>> _2015.3.187/mkl/include:/apps/intel/2015/composer_xe_2015.3.
>> 187/tbb/include
>>
>> MOD_FLAG      = -I
>>
>> MPIF90         = mpif90
>> #F90           = ifort
>> CC             = icc
>> F77            = ifort
>>
>> CPP            = cpp
>> CPPFLAGS       = -P -C -traditional $(DFLAGS) $(IFLAGS)
>>
>> CFLAGS         = -O3 $(DFLAGS) $(IFLAGS)
>> F90FLAGS       = $(FFLAGS) -nomodule -fpp $(FDFLAGS) $(IFLAGS) $(MODFLAGS)
>> FFLAGS         = -O2 -assume byterecl -g -traceback
>>
>> FFLAGS_NOOPT   = -O0 -assume byterecl -g -traceback
>>
>> FFLAGS_NOMAIN   = -nofor_main
>>
>> LD             = mpif90
>> LDFLAGS        =
>> LD_LIBS        =
>>
>> BLAS_LIBS      = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
>> BLAS_LIBS_SWITCH = external
>>
>> LAPACK_LIBS    = -L/apps/intel/2015/composer_xe_2015.3.187/mkl/lib/intel64
>> -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
>> LAPACK_LIBS_SWITCH = external
>>
>> ELPA_LIBS_SWITCH = disabled
>> SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_ilp64
>>
>> FFT_LIBS       = -L/apps/intel/2015/composer_xe_2015.3.187/mkl/lib/intel64
>> -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
>>
>> MPI_LIBS       =
>>
>> MASS_LIBS      =
>>
>> AR             = ar
>> ARFLAGS        = ruv
>>
>> RANLIB         = ranlib
>>
>> FLIB_TARGETS   = all
>>
>> LIBOBJS        = ../clib/clib.a ../iotk/src/libiotk.a
>> LIBS           = $(SCALAPACK_LIBS) $(LAPACK_LIBS) $(FFT_LIBS)
>> $(BLAS_LIBS) $(MPI_LIBS) $(MASS_LIBS) $(LD_LIBS)
>>
>> WGET = wget -O
>>
>> PREFIX = /work/r/rch/espresso-5.4.0/EXE
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://pwscf.org/mailman/listinfo/pw_forum
>>
>
>
>
> --
> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> Phone +39-0432-558216 <+39%200432%20558216>, fax +39-0432-558222
> <+39%200432%20558222>
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20161206/e8dfcd1f/attachment.html>


More information about the users mailing list