[Pw_forum] pw.x crash on LSF/mvapich

Kiss, Ioan kissi at uni-mainz.de
Tue Oct 12 20:02:00 CEST 2010


Dear PWSCF users and developers,

I have a problem running pw.x in our computer center.
The MPI environment is mvapich_1.1, the queuing system is LSF, and I have
compiled PWSCF with the Intel compiler suite together with MKL libraries.
The threading via MKL is turned off by exporting OMP_NUM_THREADS=1.
The machines are 8 core Xeons with QDR Infiniband and 48GB of ECC memory/node.

I would like to perform some geometry optimizations on Cd doped
CuInSe2 with PWSCF version 4.1.2.
The FFT grid for the respective slab is 150:150:144, and it does run
on 24 CPUs (i.e. 3 nodes with 8 cores).
However, by taking the same binary and input file, if I would like to use 48, 72
or 144 CPU cores, than the job will crash right after the WFC initialization:

     Self-consistent Calculation

     iteration #  1     ecut=    25.00 Ry     beta=0.70
     Davidson diagonalization with overlap
Signal 15 received.
.
.
.
Signal 15 received.
Job  /usr/local/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/mvapich_wrapper VIADEV_USE_SHMEM_ALLREDUCE=0
 VIADEV_USE_SHMEM_REDUCE=0 VIADEV_USE_SHMEM_BARRIER=0 DISABLE_RDMA_ALLTOALL=1
DISABLE_RDMA_ALLGATHER=1 DISABLE_RDMA_BARRIER=1 MV2_CPU_MAPPING=0:1:2:3:4:5:6:7 ./pwTest.x -in INP-PWSCF

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00000 moment1    /usr/local/lsf/l  Exit (1)                 10/12/2010 19:20:36
.
.
.
00001 moment1    /usr/local/lsf/l  Exit (174)              10/12/2010 19:20:36

As you can see, I have already tried to deactivate the shared memory optimizations
implemented in mvapich in the Nemesis routines, but that did not help either.
Strangely, on the same machine I can run CPMD without any issues, so I am really wondering
what I am doing wrong or what should I change to fix this problem. I have tried several different
MKL versions and so forth, but to be honest it seems to me that I just cannot fix it.
Also, using the same input file and 48-72 CPUs the job will nicely finish in Juelich supercomputer
center and also in the department's tiny local cluster running OpenMPI.

Do you have some ideas why the machine under LSF/mvapich is not fully cooperating with
PWSCF above 24 CPU cores, or what should be done to remedy this issue?


Thanks in advance for any helpful comment,

Janos.


==========================================
  Dr. Janos Kiss      e-mail: kissi at uni-mainz.de
  Johannes Gutenberg-Universitaet
  Institut f. Anorg. u. Analyt. Chemie
  AK Prof. Dr. Claudia Felser
  Staudinger Weg 9 / Raum 01-230
  55128 Mainz/ Germany
  Phone: +49-(0)6131-39-22703
  Fax:     +49-(0)6131-39-26267
  Web:     http://www.superconductivity.de/
 =========================================



More information about the users mailing list