[Pw_forum] problem with neb calculations / openmpi

Thu Mar 15 21:00:36 CET 2012

On Thu, Mar 15, 2012 at 3:48 PM, Torstein Fjermestad
<torstein.fjermestad at kjemi.uio.no> wrote:
> Dear Dr. Kohlmeyer,
>
> Thank you for your suggestion. Since yesterday I have made some progress,
> and I know think I see a more systematic behavior. What I did was first to
> recompile pw.x and neb.x with a newer version of openmpi
> (openmpi/1.4.3.gnu). Apparently this made the openmpi error message
> disappear.
> pw.x now works without problem, but neb.x only works when one node (with 8
> processors) is requested. I have run two tests requesting two nodes (16
> processors) and in both cases I see the same erroneous behavior:
>
> The output file is only 13 lines long and the last three lines are as
> follows:
>
>     Parallel version (MPI), running on     2 processors
>     path-images division:  nimage    =    8
>     R & G space division:  proc/pool =    2
>
> In the working directory of the calculations the files of type out.5_0
> contain several million repetitions of the error message
>
>
>      Message from routine  read_line :
>      read error
>
> I think this behavior is general because it is rather unlikely that both
> calculations accidentally are submitted to a defect node. To me it seems
> like there is some kind of failure in the communication between the nodes.
>
> I should certainly contact the sysadmin of the machine, but in order to make
> their work easier, I would like to make sure whether the erroneous behavior
> is caused by the machine or by the compilation/installation of quantum
> espresso.

more likely in this respect would be that you specify, create,
and/or try to access a temporary directory that is only available
on the first node of a group of nodes. this also can happen, if the
various nodes in the cluster don't share your home directory, but
that has become very unlikely with the way that most clusters
are set up these days.

whether any of this applies can be seen from your input,
job submission script, and information about how the
cluster itself is being set up.

cheers,
    axel.

>
> If anyone has had similar experiences before, it would be nice if you could
> share ideas on possible causes.
>
> Thanks in advance.
>
>
> Yours sincerely,
>
> Torstein Fjermestad
> University of Oslo,
> Norway
>
>
> On Wed, 14 Mar 2012 16:36:15 -0400, Axel Kohlmeyer <akohlmey at gmail.com>
> wrote:
>>
>> On Wed, Mar 14, 2012 at 4:18 PM, Torstein Fjermestad
>> <torstein.fjermestad at kjemi.uio.no> wrote:
>>>
>>>  Dear all,
>>>
>>>  I recently installed quantum espresso v4.3.2 in my home directory at an
>>>  external supercomputer cluster.
>>>  The way I did this was to execute the following commands:
>>>
>>>  ./configure
>>>  make all
>>>
>>>  after first having loaded the the mpi environment, the fortran and C
>>>  compiler with the following commands:
>>>  module load openmpi
>>>  module load g95/093
>>>  module load gcc
>>>
>>>  ./configure was successful and make seemed to finish normally (at least
>>>  I did not get any error message).
>>>
>>>  So far I have only been using the pw.x and neb.x executables.
>>>  In a file named "slurm-jobID.out" that is generated by the queuing
>>>  system, I get the following message when running both pw.x and neb.x:
>>>
>>>  mca: base: component_find: unable to open
>>>  /site/VERSIONS/openmpi-1.3.3.gnu/lib/openmpi/mca_mtl_psm: perhaps a
>>>  missing symbol, or compiled for a different version of Open MPI?
>>>  (ignored)
>>>
>>>  This message seems rather clear, but I am not sure how relevant it is
>>>  because pw.x runs without problem on 64 processors (I have compared the
>>>  output with that generated on another machine). neb.x on the other hand
>>>  works when running on a single processor, but fails when running in
>>>  parallel (yes, I have used the -inp option).
>>>
>>>  The output of the neb calculation is only 13 lines and the last three
>>>  lines are
>>>
>>>      Parallel version (MPI), running on    16 processors
>>>      path-images division:  nimage    =   10
>>>      R & G space division:  proc/pool =   16
>>>
>>>
>>>  In the output files out.n_0 where n={1,9} the error message
>>>
>>>      Message from routine  read_line :
>>>      read error
>>>
>>>  is repeated several thousand times.
>>>
>>>
>>>  I have a feeling that there is something I have got wrong with the
>>>  parallel environment. If I (accidentally) compiled QE for a different
>>>  openmpi version than 1.3.3.gnu, It would be interesting to know which
>>>  one. Does anyone have an idea on how I can check this?
>>>
>>>  In case the cause of the problem is a different one, it would be nice
>>>  if someone had any suggestions on how to solve it.
>>
>>
>> this sounds a lot like one of the nodes that you are using
>> has a network problem and you are trying to read from
>> an NFS exported directory, but only i/o errors. the OpenMPI
>> based error supports this. at least, i have only seen this
>> kind of error when one of the nodes in a parallel job had
>> to be rebooted hard because of a an obscure and rarely
>> triggered bug in the ethernet driver.
>>
>> you should see, if this happens always or only if there
>> is one specific node that is assigned to your job.
>> i would also talk to the sysadmin of the machine.
>>
>> HTH,
>>    axel.
>>
>>
>>>
>>>  Thank you very much in advance.
>>>
>>>  Yours sincerely,
>>>
>>>  Torstein Fjermestad
>>>  University of Oslo,
>>>  Norway
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Pw_forum mailing list
>>> Pw_forum at pwscf.org
>>> http://www.democritos.it/mailman/listinfo/pw_forum
>
>

-- 
Dr. Axel Kohlmeyer
akohlmey at gmail.com  http://goo.gl/1wk0

College of Science and Technology
Temple University, Philadelphia PA, USA.