[Q-e-developers] New restart mechanism and signal handlers
Paolo Giannozzi
paolo.giannozzi at uniud.it
Wed Jun 4 19:06:32 CEST 2014
On Wed, 2014-06-04 at 18:42 +0200, Lorenzo Paulatto wrote:
>
> That would be the best solution, rotating between two outdir/wfcdir is
> probably he easiest way to implement it.
the easiest maybe, but not the best. There should be a distinction, as
in CP, between "input" and "output" data files, but this is currently
absent
P.
> BTW, I was wondering if restarting for the last complete step in
> relax/md is still working in case of dirty stop, if it is not I think it
> should be important to support this special case.
>
>
> > carlo
> >
> > Il 04/06/2014 13:15, Paolo Giannozzi ha scritto:
> >> Interesting. I didn't consider that possibility because it used
> >> to be (years ago) quite machine-dependent, but if there is a
> >> safe and portable way to implement it, I am definitely in favor.
> >> It is not clear to me what happens in parallel execution: are
> >> all running processes "signaled"? If they aren't, it is a mess.
> >>
> >> Paolo
> >>
> >> On Wed, 2014-06-04 at 10:52 +0200, Lorenzo Paulatto wrote:
> >>> Hello,
> >>> I agree with Paolo that restart from random crash is unpredictable and
> >>> unmaintainable. However, to make everybody happy I have extended the
> >>> current (normally disable) signal handling mechanism to intercept the
> >>> signal that queues system normally send a few minutes before killing the
> >>> job to trigger a clean exit.
> >>>
> >>> Actually I had implemented this years ago, but I never uploaded it as it
> >>> increases the likeliness that the code will be forcefully killed when
> >>> it's writing the data!
> >>>
> >>> In the current situation, being killed while writing the data has the
> >>> same result (i.e. cannot restart) as being killed somewhere else, hence
> >>> I see no problem in implementing this change.
> >>>
> >>> I've also implemented clean exit when pressing CTRL-C, a double press of
> >>> CTRL-C will still kill the code immediately.
> >>>
> >>> I think that most queue systems send SIGTERM before SIGKILL, but some
> >>> may also send SIGINT or SIGUSR*, if people can start to test the code we
> >>> can easily add more signals.
> >>>
> >>> I'll go forward and upload the change later, if there are no complaint,
> >>> it is still disable by default. I think it would be sensible to enable
> >>> it by default in parallel compilation.
> >>>
> >>> cheers
> >>>
> >>>
> >>> p.s. I've also implemented the possibility to send the code in daemon
> >>> mode when SIGHUP is received (e.g. ssh connection dies, pw.x keeps
> >>> running in background) but it is fragile and not really useful, if you
> >>> want it let me know.
> >>>
> >
>
--
Paolo Giannozzi, Dept. Chemistry&Physics&Environment,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
More information about the developers
mailing list