[Pw_forum] Code starting trouble on the PC Cluster

Shujun Hu hushujun at mail.sdu.edu.cn
Fri Aug 25 07:45:15 CEST 2006


Hi Axel,
It is my pleasure to receive your advise and I am sorry for the ambiguous
description of the problem. 
  Neither the input file nor the configuration process of the code is iffy. My
trouble is to start the program, not running it. When i try to start the program
on 4 nodes of the cluster (each node contains 2 core as in Pentium-D, so 8 process
will be started) and type the command:
# mpiexec -n 8 pw.x -npool 4<job.inp
nothing appears on the screen even waiting for one night! Then I found it should
be typed again and again:

# mpiexec -n 8 pw.x -npool 4<job.inp
nothing on the screen. Then type:

# mpiexec -n 8 pw.x -npool 4<job.inp
nothing on the screen. Then type:

......

# mpiexec -n 8 pw.x -npool 4<job.inp

     Program PWSCF     v.3.0    starts ...
     Today is 14May2006 at 13: 0:51 
     Parallel version (MPI)
     Number of processors in use:       8
......

Surprisingly it is OK!!! The following process is good and it can achieve the
convergency. I have checked the process queue and found that now lots of process
named pw.x under python were sleeping. I donnt know exactly what python is. Maybe
it is part of the mpich programm. So i guess the problem comes from the connecting
of the nodes. 
  By the way,  another problem also puzzles me. There are 6 nodes in my cluster
platform. So the most efficient way is to run 12 processes. However, it seems that
the command:

# mpiexec -n 12 pw.x -npool 6<job.inp              #command-1

will sleep forever. But the following one:

# mpiexec -n 16 pw.x -npool 8<job.inp              #command-2

runs well. The FFT grid is the default setting of 75*75*50 and K-points is set to
be 4*4*6. Considering the symmetry the number of the K-points is 60 (given by the
output file). In the user's guide the "Parallelization issues" part says the
number of processors Np=Npk*Npr. The pools (Npk, 6 or 8 in the upper commands)
should be a divisor of the number of k-points. Npr, here set to be 2, certainly
meet the requirement that it should be a divisor of the FFT grid of z axis. But
Npk is strange. It seems that command-1 is more correct, at least better, than
command-2. However, the fact is opposite. I donnt know why.
  Any suggestion will be appreciated. Thanks.

                                       Shujun





More information about the users mailing list