[Pw_forum] problem with neb calculations / openmpi

Layla Martin-Samos lmartinsamos at gmail.com
Mon Mar 19 14:53:29 CET 2012


Dear Torstein could you send the file input.inp, just to try to reproduce
the error in an other machine?

bests

Layla

2012/3/19 Torstein Fjermestad <torstein.fjermestad at kjemi.uio.no>

>  Dear Prof. Giannozzi,
>
>  Thanks for the suggestion.
>  The two tests I referred to were both run with image parallelization
>  (16 processors and 8 images).
>  The tests were run with the same input file and submit script. The
>  command line was as follows:
>
>  mpirun -np 16 -npernode 8 neb.x -nimage 8 -inp input.inp > output.out
>
>  In this case the job is submitted and is labelled as "running". It
>  stays like this until the end of the requested time, but it produces no
>  output. At the end of the file slurm-<jobID>.out the following message
>  is printed:
>
>
>
>  slurmd[compute-14-6]: *** JOB 9146164 CANCELLED AT 2012-03-15T23:20:09
>  DUE TO TIME LIMIT ***
>  mpirun: killing job...
>
>  Job 9146164 ("neb_11") completed on compute-14-[6-7] at Thu Mar 15
>  23:20:09 CET 2012
>  --------------------------------------------------------------------------
>  mpirun noticed that process rank 0 with PID 523 on node
>  compute-14-6.local exited on signal 0 (Unknown signal 0).
>  --------------------------------------------------------------------------
>  [compute-14-6.local:00516] [[31454,0],0]-[[31454,0],1]
>  mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
>  mpirun: clean termination accomplished
>
>
>
>
>  When removing the image parallelization by either setting -nimage 1 or
>  removing the option altogether (but still running on 16 processors), the
>  job only runs for a few seconds. At the end of the file
>  slurm-<jobID>.out  the following message is printed:
>
>
>
>  from test_input_xml: Empty input file .. stopping
>  --------------------------------------------------------------------------
>  mpirun has exited due to process rank 11 with PID 32678 on
>  node compute-14-13 exiting without calling "finalize". This may
>  have caused other processes in the application to be
>  terminated by signals sent by mpirun (as reported here).
>  --------------------------------------------------------------------------
>  Job 9163874 ("neb_13") completed on compute-14-[12-13] at Sat Mar 17
>  19:55:23 CET 2012
>
>
>  I found in particular the line "from test_input_xml: Empty input file
>  .. stopping" interesting. The program stops because it thinks a file is
>  empty.
>
>  Although I did not get much closer to having a running program, I
>  thought that this change in behavior was interesting. Maybe it can give
>  you (or someone else) a hint on what is going on.
>
>  Of cause this erroneous behavior may have other causes, such as a
>  machine related issue, the openmpi environment, the installation
>  procedure, etc. However, before contacting to the sysadmin, I would like
>  to rule out (to the extent possible) any issues related to quantum
>  espresso itself.
>
>  Thanks in advance.
>
>  Yours sincerely,
>  Torstein Fjermestad
>  University of Oslo,
>  Norway
>
>
>
>
>
>
>  On Thu, 15 Mar 2012 22:26:22 +0100, Paolo Giannozzi
>  <giannozz at democritos.it> wrote:
> > On Mar 15, 2012, at 20:48 , Torstein Fjermestad wrote:
> >
> >>  pw.x now works without problem, but neb.x only works when one node
> >>  (with 8 processors) is requested. I have run two tests requesting
> >> two
> >>  nodes (16 processors) and in both cases I see the same erroneous
> >>  behavior:
> >
> > with "image" parallelization in both cases? can you run neb.x with 1
> > image
> > and 16 processors?
> >
> >
> > P.
> > ---
> > Paolo Giannozzi, Dept of Chemistry&Physics&Environment,
> > Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> > Phone +39-0432-558216, fax +39-0432-558222
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20120319/daf1595d/attachment.html>


More information about the users mailing list