[Pw_forum] problem with neb calculations / openmpi

Thu Mar 15 20:48:18 CET 2012

 Dear Dr. Kohlmeyer,

 Thank you for your suggestion. Since yesterday I have made some 
 progress, and I know think I see a more systematic behavior. What I did 
 was first to recompile pw.x and neb.x with a newer version of openmpi 
 (openmpi/1.4.3.gnu). Apparently this made the openmpi error message 
 disappear.
 pw.x now works without problem, but neb.x only works when one node 
 (with 8 processors) is requested. I have run two tests requesting two 
 nodes (16 processors) and in both cases I see the same erroneous 
 behavior:

 The output file is only 13 lines long and the last three lines are as 
 follows:

      Parallel version (MPI), running on     2 processors
      path-images division:  nimage    =    8
      R & G space division:  proc/pool =    2

 In the working directory of the calculations the files of type out.5_0 
 contain several million repetitions of the error message

       Message from routine  read_line :
       read error

 I think this behavior is general because it is rather unlikely that 
 both calculations accidentally are submitted to a defect node. To me it 
 seems like there is some kind of failure in the communication between 
 the nodes.

 I should certainly contact the sysadmin of the machine, but in order to 
 make their work easier, I would like to make sure whether the erroneous 
 behavior is caused by the machine or by the compilation/installation of 
 quantum espresso.

 If anyone has had similar experiences before, it would be nice if you 
 could share ideas on possible causes.

 Thanks in advance.

 Yours sincerely,

 Torstein Fjermestad
 University of Oslo,
 Norway

 On Wed, 14 Mar 2012 16:36:15 -0400, Axel Kohlmeyer <akohlmey at gmail.com> 
 wrote:
> On Wed, Mar 14, 2012 at 4:18 PM, Torstein Fjermestad
> <torstein.fjermestad at kjemi.uio.no> wrote:
>>  Dear all,
>>
>>  I recently installed quantum espresso v4.3.2 in my home directory 
>> at an
>>  external supercomputer cluster.
>>  The way I did this was to execute the following commands:
>>
>>  ./configure
>>  make all
>>
>>  after first having loaded the the mpi environment, the fortran and 
>> C
>>  compiler with the following commands:
>>  module load openmpi
>>  module load g95/093
>>  module load gcc
>>
>>  ./configure was successful and make seemed to finish normally (at 
>> least
>>  I did not get any error message).
>>
>>  So far I have only been using the pw.x and neb.x executables.
>>  In a file named "slurm-jobID.out" that is generated by the queuing
>>  system, I get the following message when running both pw.x and 
>> neb.x:
>>
>>  mca: base: component_find: unable to open
>>  /site/VERSIONS/openmpi-1.3.3.gnu/lib/openmpi/mca_mtl_psm: perhaps a
>>  missing symbol, or compiled for a different version of Open MPI?
>>  (ignored)
>>
>>  This message seems rather clear, but I am not sure how relevant it 
>> is
>>  because pw.x runs without problem on 64 processors (I have compared 
>> the
>>  output with that generated on another machine). neb.x on the other 
>> hand
>>  works when running on a single processor, but fails when running in
>>  parallel (yes, I have used the -inp option).
>>
>>  The output of the neb calculation is only 13 lines and the last 
>> three
>>  lines are
>>
>>      Parallel version (MPI), running on    16 processors
>>      path-images division:  nimage    =   10
>>      R & G space division:  proc/pool =   16
>>
>>
>>  In the output files out.n_0 where n={1,9} the error message
>>
>>      Message from routine  read_line :
>>      read error
>>
>>  is repeated several thousand times.
>>
>>
>>  I have a feeling that there is something I have got wrong with the
>>  parallel environment. If I (accidentally) compiled QE for a 
>> different
>>  openmpi version than 1.3.3.gnu, It would be interesting to know 
>> which
>>  one. Does anyone have an idea on how I can check this?
>>
>>  In case the cause of the problem is a different one, it would be 
>> nice
>>  if someone had any suggestions on how to solve it.
>
> this sounds a lot like one of the nodes that you are using
> has a network problem and you are trying to read from
> an NFS exported directory, but only i/o errors. the OpenMPI
> based error supports this. at least, i have only seen this
> kind of error when one of the nodes in a parallel job had
> to be rebooted hard because of a an obscure and rarely
> triggered bug in the ethernet driver.
>
> you should see, if this happens always or only if there
> is one specific node that is assigned to your job.
> i would also talk to the sysadmin of the machine.
>
> HTH,
>     axel.
>
>
>>
>>  Thank you very much in advance.
>>
>>  Yours sincerely,
>>
>>  Torstein Fjermestad
>>  University of Oslo,
>>  Norway
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://www.democritos.it/mailman/listinfo/pw_forum