[Q-e-developers] New restart mechanism and signal handlers

Wed Jun 4 18:42:01 CEST 2014

On 06/04/2014 05:09 PM, Carlo Cavazzoni wrote:
> in parallel execution all processes are signalled,
> but the time usually is very short: 10 minutes.
> For large parallel jobs this is barely enough to complete an iteration,
> not to speak about writing the checkpoint itself.
> One possible solution to prevent annoying corrupted restart
> files, is to use two copies of the restart files (wave functions)
> alternating between the two. It can work like that:
> 1) read from the most recent copy of the restart/wave functions files
> 2) write to the older copy (that will become  the most recent copy for the
> next cycle if a crash does not occur in the mean time).
>
> This double the request of disk space (which is not a issue),
> but maintain constant the amount of I/O traffic (which is an issue)
>

That would be  the best solution, rotating between two outdir/wfcdir is 
probably he easiest way to implement it.

BTW, I was wondering if restarting for the last complete step in 
relax/md is still working in case of dirty stop, if it is not I think it 
should be important to support this special case.

> carlo
>
> Il 04/06/2014 13:15, Paolo Giannozzi ha scritto:
>> Interesting. I didn't consider that possibility because it used
>> to be (years ago) quite machine-dependent, but if there is a
>> safe and portable way to implement it, I am definitely in favor.
>> It is not clear to me what happens in parallel execution: are
>> all running processes "signaled"? If they aren't, it is a mess.
>>
>> Paolo
>>
>> On Wed, 2014-06-04 at 10:52 +0200, Lorenzo Paulatto wrote:
>>> Hello,
>>> I agree with Paolo that restart from random crash is unpredictable and
>>> unmaintainable. However, to make everybody happy I have extended the
>>> current (normally disable) signal handling mechanism to intercept the
>>> signal that queues system normally send a few minutes before killing the
>>> job to trigger a clean exit.
>>>
>>> Actually I had implemented this years ago, but I never uploaded it as it
>>> increases the likeliness that the code will be forcefully killed when
>>> it's writing the data!
>>>
>>> In the current situation, being killed while writing the data has the
>>> same result (i.e. cannot restart) as being killed somewhere else, hence
>>> I see no problem in implementing this change.
>>>
>>> I've also implemented clean exit when pressing CTRL-C, a double press of
>>> CTRL-C will still kill the code immediately.
>>>
>>> I think that most queue systems send SIGTERM before SIGKILL, but some
>>> may also send SIGINT or SIGUSR*, if people can start to test the code we
>>> can easily add more signals.
>>>
>>> I'll go forward and upload the change later, if there are no complaint,
>>> it is still disable by default. I think it would be sensible to enable
>>> it by default in parallel compilation.
>>>
>>> cheers
>>>
>>>
>>> p.s. I've also implemented the possibility to send the code in daemon
>>> mode when SIGHUP is received (e.g. ssh connection dies, pw.x keeps
>>> running in background) but it is fragile and not really useful, if you
>>> want it let me know.
>>>
>

-- 
Lorenzo Paulatto - Paris 18