[Q-e-developers] New restart mechanism and signal handlers

Wed Jun 4 17:09:14 CEST 2014

in parallel execution all processes are signalled,
but the time usually is very short: 10 minutes.
For large parallel jobs this is barely enough to complete an iteration,
not to speak about writing the checkpoint itself.
One possible solution to prevent annoying corrupted restart
files, is to use two copies of the restart files (wave functions)
alternating between the two. It can work like that:
1) read from the most recent copy of the restart/wave functions files
2) write to the older copy (that will become  the most recent copy for the
next cycle if a crash does not occur in the mean time).

This double the request of disk space (which is not a issue),
but maintain constant the amount of I/O traffic (which is an issue)

carlo

Il 04/06/2014 13:15, Paolo Giannozzi ha scritto:
> Interesting. I didn't consider that possibility because it used
> to be (years ago) quite machine-dependent, but if there is a
> safe and portable way to implement it, I am definitely in favor.
> It is not clear to me what happens in parallel execution: are
> all running processes "signaled"? If they aren't, it is a mess.
>
> Paolo
>
> On Wed, 2014-06-04 at 10:52 +0200, Lorenzo Paulatto wrote:
>> Hello,
>> I agree with Paolo that restart from random crash is unpredictable and
>> unmaintainable. However, to make everybody happy I have extended the
>> current (normally disable) signal handling mechanism to intercept the
>> signal that queues system normally send a few minutes before killing the
>> job to trigger a clean exit.
>>
>> Actually I had implemented this years ago, but I never uploaded it as it
>> increases the likeliness that the code will be forcefully killed when
>> it's writing the data!
>>
>> In the current situation, being killed while writing the data has the
>> same result (i.e. cannot restart) as being killed somewhere else, hence
>> I see no problem in implementing this change.
>>
>> I've also implemented clean exit when pressing CTRL-C, a double press of
>> CTRL-C will still kill the code immediately.
>>
>> I think that most queue systems send SIGTERM before SIGKILL, but some
>> may also send SIGINT or SIGUSR*, if people can start to test the code we
>> can easily add more signals.
>>
>> I'll go forward and upload the change later, if there are no complaint,
>> it is still disable by default. I think it would be sensible to enable
>> it by default in parallel compilation.
>>
>> cheers
>>
>>
>> p.s. I've also implemented the possibility to send the code in daemon
>> mode when SIGHUP is received (e.g. ssh connection dies, pw.x keeps
>> running in background) but it is fragile and not really useful, if you
>> want it let me know.
>>

-- 
Ph.D. Carlo Cavazzoni
SuperComputing Applications and Innovation Department
CINECA - Via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna)
Tel: +39 051 6171411  Fax: +39 051 6132198
www.cineca.it