[Pw_forum] problem with neb calculations / openmpi
Torstein Fjermestad
torstein.fjermestad at kjemi.uio.no
Thu Mar 15 20:48:18 CET 2012
Dear Dr. Kohlmeyer,
Thank you for your suggestion. Since yesterday I have made some
progress, and I know think I see a more systematic behavior. What I did
was first to recompile pw.x and neb.x with a newer version of openmpi
(openmpi/1.4.3.gnu). Apparently this made the openmpi error message
disappear.
pw.x now works without problem, but neb.x only works when one node
(with 8 processors) is requested. I have run two tests requesting two
nodes (16 processors) and in both cases I see the same erroneous
behavior:
The output file is only 13 lines long and the last three lines are as
follows:
Parallel version (MPI), running on 2 processors
path-images division: nimage = 8
R & G space division: proc/pool = 2
In the working directory of the calculations the files of type out.5_0
contain several million repetitions of the error message
Message from routine read_line :
read error
I think this behavior is general because it is rather unlikely that
both calculations accidentally are submitted to a defect node. To me it
seems like there is some kind of failure in the communication between
the nodes.
I should certainly contact the sysadmin of the machine, but in order to
make their work easier, I would like to make sure whether the erroneous
behavior is caused by the machine or by the compilation/installation of
quantum espresso.
If anyone has had similar experiences before, it would be nice if you
could share ideas on possible causes.
Thanks in advance.
Yours sincerely,
Torstein Fjermestad
University of Oslo,
Norway
On Wed, 14 Mar 2012 16:36:15 -0400, Axel Kohlmeyer <akohlmey at gmail.com>
wrote:
> On Wed, Mar 14, 2012 at 4:18 PM, Torstein Fjermestad
> <torstein.fjermestad at kjemi.uio.no> wrote:
>> Dear all,
>>
>> I recently installed quantum espresso v4.3.2 in my home directory
>> at an
>> external supercomputer cluster.
>> The way I did this was to execute the following commands:
>>
>> ./configure
>> make all
>>
>> after first having loaded the the mpi environment, the fortran and
>> C
>> compiler with the following commands:
>> module load openmpi
>> module load g95/093
>> module load gcc
>>
>> ./configure was successful and make seemed to finish normally (at
>> least
>> I did not get any error message).
>>
>> So far I have only been using the pw.x and neb.x executables.
>> In a file named "slurm-jobID.out" that is generated by the queuing
>> system, I get the following message when running both pw.x and
>> neb.x:
>>
>> mca: base: component_find: unable to open
>> /site/VERSIONS/openmpi-1.3.3.gnu/lib/openmpi/mca_mtl_psm: perhaps a
>> missing symbol, or compiled for a different version of Open MPI?
>> (ignored)
>>
>> This message seems rather clear, but I am not sure how relevant it
>> is
>> because pw.x runs without problem on 64 processors (I have compared
>> the
>> output with that generated on another machine). neb.x on the other
>> hand
>> works when running on a single processor, but fails when running in
>> parallel (yes, I have used the -inp option).
>>
>> The output of the neb calculation is only 13 lines and the last
>> three
>> lines are
>>
>> Parallel version (MPI), running on 16 processors
>> path-images division: nimage = 10
>> R & G space division: proc/pool = 16
>>
>>
>> In the output files out.n_0 where n={1,9} the error message
>>
>> Message from routine read_line :
>> read error
>>
>> is repeated several thousand times.
>>
>>
>> I have a feeling that there is something I have got wrong with the
>> parallel environment. If I (accidentally) compiled QE for a
>> different
>> openmpi version than 1.3.3.gnu, It would be interesting to know
>> which
>> one. Does anyone have an idea on how I can check this?
>>
>> In case the cause of the problem is a different one, it would be
>> nice
>> if someone had any suggestions on how to solve it.
>
> this sounds a lot like one of the nodes that you are using
> has a network problem and you are trying to read from
> an NFS exported directory, but only i/o errors. the OpenMPI
> based error supports this. at least, i have only seen this
> kind of error when one of the nodes in a parallel job had
> to be rebooted hard because of a an obscure and rarely
> triggered bug in the ethernet driver.
>
> you should see, if this happens always or only if there
> is one specific node that is assigned to your job.
> i would also talk to the sysadmin of the machine.
>
> HTH,
> axel.
>
>
>>
>> Thank you very much in advance.
>>
>> Yours sincerely,
>>
>> Torstein Fjermestad
>> University of Oslo,
>> Norway
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://www.democritos.it/mailman/listinfo/pw_forum
More information about the users
mailing list