[Q-e-developers] New restart mechanism and signal handlers
Lorenzo Paulatto
lorenzo.paulatto at impmc.upmc.fr
Wed Jun 4 13:33:15 CEST 2014
I depends on the MPI implementation. What normally happens is that the
MPI master process (not pw.x) receives the signal and it proceeds to
broadcast it to every instance of pw.x
What the code does is that when the code receives the signal, a flag is
set as a local variable in a module. Later when the codes passes by
check_stop the value of the variable is broadcast and if found to be
true the codes starts to stop. It does not matter if every process
received the signal or not; currently it is root from mp_world who does
the broadcast. This could be fixed (use an integer flag initially set to
zero, set to one at signal, call mp_sum, check for >0) but I never found
a place where the current implementation does not work, it will only
fail in the strange case where you manually send the signal to a process
other than the root.
On 06/04/2014 01:15 PM, Paolo Giannozzi wrote:
> Interesting. I didn't consider that possibility because it used
> to be (years ago) quite machine-dependent, but if there is a
> safe and portable way to implement it, I am definitely in favor.
> It is not clear to me what happens in parallel execution: are
> all running processes "signaled"? If they aren't, it is a mess.
>
> Paolo
>
> On Wed, 2014-06-04 at 10:52 +0200, Lorenzo Paulatto wrote:
>> Hello,
>> I agree with Paolo that restart from random crash is unpredictable and
>> unmaintainable. However, to make everybody happy I have extended the
>> current (normally disable) signal handling mechanism to intercept the
>> signal that queues system normally send a few minutes before killing the
>> job to trigger a clean exit.
>>
>> Actually I had implemented this years ago, but I never uploaded it as it
>> increases the likeliness that the code will be forcefully killed when
>> it's writing the data!
>>
>> In the current situation, being killed while writing the data has the
>> same result (i.e. cannot restart) as being killed somewhere else, hence
>> I see no problem in implementing this change.
>>
>> I've also implemented clean exit when pressing CTRL-C, a double press of
>> CTRL-C will still kill the code immediately.
>>
>> I think that most queue systems send SIGTERM before SIGKILL, but some
>> may also send SIGINT or SIGUSR*, if people can start to test the code we
>> can easily add more signals.
>>
>> I'll go forward and upload the change later, if there are no complaint,
>> it is still disable by default. I think it would be sensible to enable
>> it by default in parallel compilation.
>>
>> cheers
>>
>>
>> p.s. I've also implemented the possibility to send the code in daemon
>> mode when SIGHUP is received (e.g. ssh connection dies, pw.x keeps
>> running in background) but it is fragile and not really useful, if you
>> want it let me know.
>>
--
Dr. Lorenzo Paulatto
IdR @ IMPMC -- CNRS & Université Paris 6
+33 (0)1 44 275 084 / skype: paulatz
http://www-int.impmc.upmc.fr/~paulatto/
23-24/4é16 Boîte courrier 115, 4 place Jussieu 75252 Paris Cédex 05
More information about the developers
mailing list