[Pw_forum] problem with neb calculations / openmpi

Torstein Fjermestad torstein.fjermestad at kjemi.uio.no
Mon Mar 19 11:12:13 CET 2012


 Dear Prof. Giannozzi,

 Thanks for the suggestion.
 The two tests I referred to were both run with image parallelization 
 (16 processors and 8 images).
 The tests were run with the same input file and submit script. The 
 command line was as follows:

 mpirun -np 16 -npernode 8 neb.x -nimage 8 -inp input.inp > output.out

 In this case the job is submitted and is labelled as "running". It 
 stays like this until the end of the requested time, but it produces no 
 output. At the end of the file slurm-<jobID>.out the following message 
 is printed:



 slurmd[compute-14-6]: *** JOB 9146164 CANCELLED AT 2012-03-15T23:20:09 
 DUE TO TIME LIMIT ***
 mpirun: killing job...

 Job 9146164 ("neb_11") completed on compute-14-[6-7] at Thu Mar 15 
 23:20:09 CET 2012
 --------------------------------------------------------------------------
 mpirun noticed that process rank 0 with PID 523 on node 
 compute-14-6.local exited on signal 0 (Unknown signal 0).
 --------------------------------------------------------------------------
 [compute-14-6.local:00516] [[31454,0],0]-[[31454,0],1] 
 mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
 mpirun: clean termination accomplished

 


 When removing the image parallelization by either setting -nimage 1 or 
 removing the option altogether (but still running on 16 processors), the 
 job only runs for a few seconds. At the end of the file 
 slurm-<jobID>.out  the following message is printed:



  from test_input_xml: Empty input file .. stopping
 --------------------------------------------------------------------------
 mpirun has exited due to process rank 11 with PID 32678 on
 node compute-14-13 exiting without calling "finalize". This may
 have caused other processes in the application to be
 terminated by signals sent by mpirun (as reported here).
 --------------------------------------------------------------------------
 Job 9163874 ("neb_13") completed on compute-14-[12-13] at Sat Mar 17 
 19:55:23 CET 2012


 I found in particular the line "from test_input_xml: Empty input file 
 .. stopping" interesting. The program stops because it thinks a file is 
 empty.

 Although I did not get much closer to having a running program, I 
 thought that this change in behavior was interesting. Maybe it can give 
 you (or someone else) a hint on what is going on.

 Of cause this erroneous behavior may have other causes, such as a 
 machine related issue, the openmpi environment, the installation 
 procedure, etc. However, before contacting to the sysadmin, I would like 
 to rule out (to the extent possible) any issues related to quantum 
 espresso itself.

 Thanks in advance.

 Yours sincerely,
 Torstein Fjermestad
 University of Oslo,
 Norway

 
 



 On Thu, 15 Mar 2012 22:26:22 +0100, Paolo Giannozzi 
 <giannozz at democritos.it> wrote:
> On Mar 15, 2012, at 20:48 , Torstein Fjermestad wrote:
>
>>  pw.x now works without problem, but neb.x only works when one node
>>  (with 8 processors) is requested. I have run two tests requesting 
>> two
>>  nodes (16 processors) and in both cases I see the same erroneous
>>  behavior:
>
> with "image" parallelization in both cases? can you run neb.x with 1  
> image
> and 16 processors?
>
>
> P.
> ---
> Paolo Giannozzi, Dept of Chemistry&Physics&Environment,
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> Phone +39-0432-558216, fax +39-0432-558222




More information about the users mailing list