[Pw_forum] problem with neb calculations / openmpi
Axel Kohlmeyer
akohlmey at gmail.com
Wed Mar 14 21:36:15 CET 2012
On Wed, Mar 14, 2012 at 4:18 PM, Torstein Fjermestad
<torstein.fjermestad at kjemi.uio.no> wrote:
> Dear all,
>
> I recently installed quantum espresso v4.3.2 in my home directory at an
> external supercomputer cluster.
> The way I did this was to execute the following commands:
>
> ./configure
> make all
>
> after first having loaded the the mpi environment, the fortran and C
> compiler with the following commands:
> module load openmpi
> module load g95/093
> module load gcc
>
> ./configure was successful and make seemed to finish normally (at least
> I did not get any error message).
>
> So far I have only been using the pw.x and neb.x executables.
> In a file named "slurm-jobID.out" that is generated by the queuing
> system, I get the following message when running both pw.x and neb.x:
>
> mca: base: component_find: unable to open
> /site/VERSIONS/openmpi-1.3.3.gnu/lib/openmpi/mca_mtl_psm: perhaps a
> missing symbol, or compiled for a different version of Open MPI?
> (ignored)
>
> This message seems rather clear, but I am not sure how relevant it is
> because pw.x runs without problem on 64 processors (I have compared the
> output with that generated on another machine). neb.x on the other hand
> works when running on a single processor, but fails when running in
> parallel (yes, I have used the -inp option).
>
> The output of the neb calculation is only 13 lines and the last three
> lines are
>
> Parallel version (MPI), running on 16 processors
> path-images division: nimage = 10
> R & G space division: proc/pool = 16
>
>
> In the output files out.n_0 where n={1,9} the error message
>
> Message from routine read_line :
> read error
>
> is repeated several thousand times.
>
>
> I have a feeling that there is something I have got wrong with the
> parallel environment. If I (accidentally) compiled QE for a different
> openmpi version than 1.3.3.gnu, It would be interesting to know which
> one. Does anyone have an idea on how I can check this?
>
> In case the cause of the problem is a different one, it would be nice
> if someone had any suggestions on how to solve it.
this sounds a lot like one of the nodes that you are using
has a network problem and you are trying to read from
an NFS exported directory, but only i/o errors. the OpenMPI
based error supports this. at least, i have only seen this
kind of error when one of the nodes in a parallel job had
to be rebooted hard because of a an obscure and rarely
triggered bug in the ethernet driver.
you should see, if this happens always or only if there
is one specific node that is assigned to your job.
i would also talk to the sysadmin of the machine.
HTH,
axel.
>
> Thank you very much in advance.
>
> Yours sincerely,
>
> Torstein Fjermestad
> University of Oslo,
> Norway
>
>
>
>
>
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum
--
Dr. Axel Kohlmeyer
akohlmey at gmail.com http://goo.gl/1wk0
College of Science and Technology
Temple University, Philadelphia PA, USA.
More information about the users
mailing list