[Pw_forum] problem with neb calculations / openmpi
Layla Martin-Samos
lmartinsamos at gmail.com
Mon Mar 19 14:53:29 CET 2012
Dear Torstein could you send the file input.inp, just to try to reproduce
the error in an other machine?
bests
Layla
2012/3/19 Torstein Fjermestad <torstein.fjermestad at kjemi.uio.no>
> Dear Prof. Giannozzi,
>
> Thanks for the suggestion.
> The two tests I referred to were both run with image parallelization
> (16 processors and 8 images).
> The tests were run with the same input file and submit script. The
> command line was as follows:
>
> mpirun -np 16 -npernode 8 neb.x -nimage 8 -inp input.inp > output.out
>
> In this case the job is submitted and is labelled as "running". It
> stays like this until the end of the requested time, but it produces no
> output. At the end of the file slurm-<jobID>.out the following message
> is printed:
>
>
>
> slurmd[compute-14-6]: *** JOB 9146164 CANCELLED AT 2012-03-15T23:20:09
> DUE TO TIME LIMIT ***
> mpirun: killing job...
>
> Job 9146164 ("neb_11") completed on compute-14-[6-7] at Thu Mar 15
> 23:20:09 CET 2012
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 523 on node
> compute-14-6.local exited on signal 0 (Unknown signal 0).
> --------------------------------------------------------------------------
> [compute-14-6.local:00516] [[31454,0],0]-[[31454,0],1]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> mpirun: clean termination accomplished
>
>
>
>
> When removing the image parallelization by either setting -nimage 1 or
> removing the option altogether (but still running on 16 processors), the
> job only runs for a few seconds. At the end of the file
> slurm-<jobID>.out the following message is printed:
>
>
>
> from test_input_xml: Empty input file .. stopping
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 11 with PID 32678 on
> node compute-14-13 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> Job 9163874 ("neb_13") completed on compute-14-[12-13] at Sat Mar 17
> 19:55:23 CET 2012
>
>
> I found in particular the line "from test_input_xml: Empty input file
> .. stopping" interesting. The program stops because it thinks a file is
> empty.
>
> Although I did not get much closer to having a running program, I
> thought that this change in behavior was interesting. Maybe it can give
> you (or someone else) a hint on what is going on.
>
> Of cause this erroneous behavior may have other causes, such as a
> machine related issue, the openmpi environment, the installation
> procedure, etc. However, before contacting to the sysadmin, I would like
> to rule out (to the extent possible) any issues related to quantum
> espresso itself.
>
> Thanks in advance.
>
> Yours sincerely,
> Torstein Fjermestad
> University of Oslo,
> Norway
>
>
>
>
>
>
> On Thu, 15 Mar 2012 22:26:22 +0100, Paolo Giannozzi
> <giannozz at democritos.it> wrote:
> > On Mar 15, 2012, at 20:48 , Torstein Fjermestad wrote:
> >
> >> pw.x now works without problem, but neb.x only works when one node
> >> (with 8 processors) is requested. I have run two tests requesting
> >> two
> >> nodes (16 processors) and in both cases I see the same erroneous
> >> behavior:
> >
> > with "image" parallelization in both cases? can you run neb.x with 1
> > image
> > and 16 processors?
> >
> >
> > P.
> > ---
> > Paolo Giannozzi, Dept of Chemistry&Physics&Environment,
> > Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> > Phone +39-0432-558216, fax +39-0432-558222
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20120319/daf1595d/attachment.html>
More information about the users
mailing list