[Pw_forum] "MPI_COMM_RANK : Null communicator..." error through Platform LSF system

wangxinquan wangxinquan at tju.edu.cn
Thu Apr 10 04:47:07 CEST 2008


Dear all,
    To solve "MPI_COMM_RANK..." problem, I have modified make.sys.
    IFLAGS=-I../include -I/usr/local/mpich/1.2.6..13a/gm-2.1.3aa2nks3/smp/intel32/ssh/include
    MPI_LIBS=/usr/local/mpich/1.2.6..13a/gm-2.1.3aa2nks3/smp/intel32/ssh/lib/libmpichf90.a
    For IFLAGS parameter, the path of mpif.h was included.
    Finally, the "MPI_COMM_RANK..." problem has been solved.

    Unfortunately, a new error appeared.
    The output (if any) follows:
    test icymoon
    [0] MPI Abort by user Aborting program!
    [0] Aborting program!
    test icymoon end

    CRASH file:   
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     task #         0
     from  read_namelists  : error #         1
      reading namelist control 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     task #         1
     from  read_namelists  : error #         1
      reading namelist control 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    The input file is as follow:
    ------------------------------------------------------------------
    <cu.scf.in>
    &control
       calculation='scf'
       restart_mode='from_scratch',
       pseudo_dir='/nfs/s04r2p1/wangxq_tj/espresso-3.2.3/pseudo/',
       outdir='/nfs/s04r2p1/wangxq_tj/',
       prefix='cu'
    /
    &system
       ... ...
    ------------------------------------------------------------------
    I have checked "control" section. But I can not find any error in it.
    It confused me that whether it is an input file error or a mpich error?
    
    Any help will be deeply appreciated!

Best regards, XQ Wang

------------------------------

Message: 3
Date: Wed, 09 Apr 2008 11:50:07 +0800
From: ""  <wangxinquan at tju.edu.cn >
Subject: [Pw_forum] "MPI_COMM_RANK : Null communicator..." error
through Platform LSF system
To: pw_forum at pwscf.org
Message-ID:  <407713007.16944 at tju.edu.cn >
Content-Type: text/plain

Dear users and developers,

     Recently I have done a test on Nankai Stars HPC. The error message 
"MPI_COMM_RANK : Null communicator…Aborting program !"appeared when I did 
a scf calculation through 2 cpu (2nodes). 

     To solve this problem, I have found some hints from google, such as“please
make sure that you used the same version of MPI for compiling and running, and
included the corresponding header file mpi.h in your code.” 
(http://www.ncsa.edu/UserInfo/Resources/Hardware/XeonCluster/FAQ/XeonJobs.html)

     According to the pwscf mailing list,"dynamic port number used in mpi
intercommunication is not working. This is most probably an installation issue
regarding LSF." may be a problem. 
(http://www.democritos.it/pipermail/pw_forum/2007-June/006689.html)

     According to the pwscf manual,"Your machine might be configured so as to 
disallow interactive execution" may be another problem.

     My question is:
     To solve “MPI_COMM_RANK…” problem, do I need to modify pwscf code,
mpich_gm code or LSF system?

Calculation Details are as follows:
---------------------------------------------------------------------------------
HPC background:
Nankai Stars (http://202.113.29.200/introduce.htm)
800 Xeon 3.06 Ghz CPU (400 nodes)   
800 GB Memory    
53T High-Speed Storage    
Myrinet
Parallel jobs are run and debuged through Platform LSF system.
Mpich_gm driver:1.2.6..13a
Espresso-3.2.3
---------------------------------------------------------------------------------

---------------------------------------------------------------------------------
Installation:
/configure CC=mpicc F77=mpif77 F90=mpif90
make all
---------------------------------------------------------------------------------

---------------------------------------------------------------------------------
Submit script :
#!/bin/bash
#BSUB -q normal
#BSUB -J test.icymoon
#BSUB -c 3:00
#BSUB -a "mpich_gm"
#BSUB -o %J.log
#BSUB -n 2 

cd /nfs/s04r2p1/wangxq_tj
echo "test icymoon"

mpirun.lsf /nfs/s04r2p1/wangxq_tj/espresso-3.2.3/bin/pw.x  <
/nfs/s04r2p1/wangxq_tj/cu.scf.in  > cu.scf.out

echo "test icymoon end"
---------------------------------------------------------------------------------

---------------------------------------------------------------------------------
Output file (%J.log):

… …
The output (if any) follows:

test icymoon
0 - MPI_COMM_RANK : Null communicator
[0]  Aborting program !
[0] Aborting program!
test icymoon end
---------------------------------------------------------------------------------

---------------------------------------------------------------------------------
<cu.scf.in >
&control

    calculation='scf'
    restart_mode='from_scratch',
    pseudo_dir = '/nfs/s04r2p1/wangxq_tj/espresso-3.2.3/pseudo/',
    outdir='/nfs/s04r2p1/wangxq_tj/',
    prefix='cu'
 /

 &system

    ibrav = 2, celldm(1) =6.73, nat= 1, ntyp= 1,
    ecutwfc = 25.0, ecutrho = 300.0
    occupations='smearing', smearing='methfessel-paxton', degauss=0.02
    noncolin = .true.
    starting_magnetization(1) = 0.5
    angle1(1) = 90.0
    angle2(1) =  0.0
 /

 &electrons

    conv_thr = 1.0e-8
    mixing_beta = 0.7 
 /

ATOMIC_SPECIES
 Cu 63.55 Cu.pz-d-rrkjus.UPF
ATOMIC_POSITIONS
 Cu 0.0 0.0 0.0
K_POINTS (automatic)
 8 8 8 0 0 0
--------------------------------------------------------------------------------

---------------------------------------------------------------------------------
cu.scf.out

1 - MPI_COMM_RANK : Null communicator
[1]  Aborting program !
[1] Aborting program!

TID  HOST_NAME    COMMAND_LINE            STATUS            TERMINATION_TIME

==== ========== ================  =======================  ===================

0001 node333                      Exit (255)               04/08/2008 19:36:59

0002 node284                      Exit (255)               04/08/2008 19:36:59

---------------------------------------------------------------------------------

Any help will be deeply appreciated!

Best regards,

=====================================

X.Q. Wang 

wangxinquan at tju.edu.cn

School of Chemical Engineering and Technology

Tianjin University

92 Weijin Road, Tianjin, P. R. China

tel:86-22-27890268, fax: 86-22-27892301

=====================================




------------------------------

_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://www.democritos.it/mailman/listinfo/pw_forum


End of Pw_forum Digest, Vol 10, Issue 13
****************************************

=====================================
X.Q. Wang
wangxinquan at tju.edu.cn
Schoolof Chemical Engineeringand Technology
TianjinUniversity
92 Weijin Road, Tianjin, P. R. China
tel:86-22-27890268, fax: 86-22-27892301
=====================================


More information about the users mailing list