[Pw_forum] problem with neb calculations / openmpi
Torstein Fjermestad
torstein.fjermestad at kjemi.uio.no
Mon Mar 19 11:12:13 CET 2012
Dear Prof. Giannozzi,
Thanks for the suggestion.
The two tests I referred to were both run with image parallelization
(16 processors and 8 images).
The tests were run with the same input file and submit script. The
command line was as follows:
mpirun -np 16 -npernode 8 neb.x -nimage 8 -inp input.inp > output.out
In this case the job is submitted and is labelled as "running". It
stays like this until the end of the requested time, but it produces no
output. At the end of the file slurm-<jobID>.out the following message
is printed:
slurmd[compute-14-6]: *** JOB 9146164 CANCELLED AT 2012-03-15T23:20:09
DUE TO TIME LIMIT ***
mpirun: killing job...
Job 9146164 ("neb_11") completed on compute-14-[6-7] at Thu Mar 15
23:20:09 CET 2012
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 523 on node
compute-14-6.local exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
[compute-14-6.local:00516] [[31454,0],0]-[[31454,0],1]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
mpirun: clean termination accomplished
When removing the image parallelization by either setting -nimage 1 or
removing the option altogether (but still running on 16 processors), the
job only runs for a few seconds. At the end of the file
slurm-<jobID>.out the following message is printed:
from test_input_xml: Empty input file .. stopping
--------------------------------------------------------------------------
mpirun has exited due to process rank 11 with PID 32678 on
node compute-14-13 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Job 9163874 ("neb_13") completed on compute-14-[12-13] at Sat Mar 17
19:55:23 CET 2012
I found in particular the line "from test_input_xml: Empty input file
.. stopping" interesting. The program stops because it thinks a file is
empty.
Although I did not get much closer to having a running program, I
thought that this change in behavior was interesting. Maybe it can give
you (or someone else) a hint on what is going on.
Of cause this erroneous behavior may have other causes, such as a
machine related issue, the openmpi environment, the installation
procedure, etc. However, before contacting to the sysadmin, I would like
to rule out (to the extent possible) any issues related to quantum
espresso itself.
Thanks in advance.
Yours sincerely,
Torstein Fjermestad
University of Oslo,
Norway
On Thu, 15 Mar 2012 22:26:22 +0100, Paolo Giannozzi
<giannozz at democritos.it> wrote:
> On Mar 15, 2012, at 20:48 , Torstein Fjermestad wrote:
>
>> pw.x now works without problem, but neb.x only works when one node
>> (with 8 processors) is requested. I have run two tests requesting
>> two
>> nodes (16 processors) and in both cases I see the same erroneous
>> behavior:
>
> with "image" parallelization in both cases? can you run neb.x with 1
> image
> and 16 processors?
>
>
> P.
> ---
> Paolo Giannozzi, Dept of Chemistry&Physics&Environment,
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> Phone +39-0432-558216, fax +39-0432-558222
More information about the users
mailing list