[Pw_forum] Fwd: Error in restarting a ph.x job
Paolo Giannozzi
paolo.giannozzi at uniud.it
Thu May 16 17:55:23 CEST 2013
Restarting from an interrupted phonon run is a risky business
in all cases. Restarting from a crash while writing to disk is
hopeless. You might be able to salvage the contribution to the
dynamical matrices from already calculated irreps, but it
depends upon what you were exactly doing
P.
On Thu, 2013-05-16 at 21:21 +1000, Hongze Xia wrote:
> Dear qe users,
>
>
>
> Recently my ph.x job was interrupted by another workstation user who
> occupied the whole disk space causing ph.x not able to write files to
> disk. When I tried to resume it from the last run. The error occurred:
>
>
> #################################################################################
> # FROM IOTK LIBRARY, VERSION 1.2.0
> # UNRECOVERABLE ERROR (ierr=1)
> # ERROR IN: iotk_getline (iotk_scan.f90:947)
> # CVS Revision: 1.23
> #
> iostat=-1
> # ERROR IN: iotk_scan_tag (iotk_scan.f90:593)
> # CVS Revision: 1.23
> # ERROR IN: iotk_scan (iotk_scan.f90:821)
> # CVS Revision: 1.23
> # ERROR IN: iotk_scan_begin (iotk_scan.f90:98)
> # CVS Revision: 1.23
> ########################################################################################################################
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 1.
>
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 2898 on
> node HotCarrier-Z820 exiting improperly. There are two reasons this
> could occur:
>
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls
> "init",
> then ALL processes must call "init" prior to termination.
>
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
>
> [1]+ Exit 1 mpirun -np 8 ph.x < SnS62.ph.in >
> SnS62.ph.out2
>
>
> Any idea that can help me resume the job which has been running for
> weeks? I will cry for a day if I need to start it from scratch. I will
> appreciate any suggestion. Thanks a lot.
>
>
> --
> Best Regards,
> Hongze Xia
> PhD candidate in Photovoltaics Engineering
> University of New South Wales
> Sydney 2052 Australia
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
--
Paolo Giannozzi, Dept. Chemistry&Physics&Environment,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
More information about the users
mailing list