[Pw_forum] problem with neb calculations / openmpi

Wed Mar 14 21:36:15 CET 2012

On Wed, Mar 14, 2012 at 4:18 PM, Torstein Fjermestad
<torstein.fjermestad at kjemi.uio.no> wrote:
>  Dear all,
>
>  I recently installed quantum espresso v4.3.2 in my home directory at an
>  external supercomputer cluster.
>  The way I did this was to execute the following commands:
>
>  ./configure
>  make all
>
>  after first having loaded the the mpi environment, the fortran and C
>  compiler with the following commands:
>  module load openmpi
>  module load g95/093
>  module load gcc
>
>  ./configure was successful and make seemed to finish normally (at least
>  I did not get any error message).
>
>  So far I have only been using the pw.x and neb.x executables.
>  In a file named "slurm-jobID.out" that is generated by the queuing
>  system, I get the following message when running both pw.x and neb.x:
>
>  mca: base: component_find: unable to open
>  /site/VERSIONS/openmpi-1.3.3.gnu/lib/openmpi/mca_mtl_psm: perhaps a
>  missing symbol, or compiled for a different version of Open MPI?
>  (ignored)
>
>  This message seems rather clear, but I am not sure how relevant it is
>  because pw.x runs without problem on 64 processors (I have compared the
>  output with that generated on another machine). neb.x on the other hand
>  works when running on a single processor, but fails when running in
>  parallel (yes, I have used the -inp option).
>
>  The output of the neb calculation is only 13 lines and the last three
>  lines are
>
>      Parallel version (MPI), running on    16 processors
>      path-images division:  nimage    =   10
>      R & G space division:  proc/pool =   16
>
>
>  In the output files out.n_0 where n={1,9} the error message
>
>      Message from routine  read_line :
>      read error
>
>  is repeated several thousand times.
>
>
>  I have a feeling that there is something I have got wrong with the
>  parallel environment. If I (accidentally) compiled QE for a different
>  openmpi version than 1.3.3.gnu, It would be interesting to know which
>  one. Does anyone have an idea on how I can check this?
>
>  In case the cause of the problem is a different one, it would be nice
>  if someone had any suggestions on how to solve it.

this sounds a lot like one of the nodes that you are using
has a network problem and you are trying to read from
an NFS exported directory, but only i/o errors. the OpenMPI
based error supports this. at least, i have only seen this
kind of error when one of the nodes in a parallel job had
to be rebooted hard because of a an obscure and rarely
triggered bug in the ethernet driver.

you should see, if this happens always or only if there
is one specific node that is assigned to your job.
i would also talk to the sysadmin of the machine.

HTH,
    axel.

>
>  Thank you very much in advance.
>
>  Yours sincerely,
>
>  Torstein Fjermestad
>  University of Oslo,
>  Norway
>
>
>
>
>
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum

-- 
Dr. Axel Kohlmeyer
akohlmey at gmail.com  http://goo.gl/1wk0

College of Science and Technology
Temple University, Philadelphia PA, USA.