[Pw_forum] Can I get away with restart_mode="restart" in this case? max_seconds did not end the calculation right.

Joshua Davis davis101 at chemistry.msu.edu
Mon Feb 15 19:46:51 CET 2016


Thank you for your Help.

----------------------------------------------------------------------------------------------------------------
Joshua D. Davis
davis.d.josh at gmail.com
Cell: (734)707-1790

Graduate Assistant
Department of Chemistry
Michigan State University

578 S. Shaw Lane, room 432
East Lansing, MI 48824
-----------------------------------------------------------------------------------------------------------------

On Mon, Feb 15, 2016 at 12:37 PM, Paolo Giannozzi <p.giannozzi at gmail.com>
wrote:

> The check that "max_seconds" have elapsed is done at the end of each
> single diagonalization, so if the latter takes "many_seconds", the check
> may be triggered in the worst case when "max_seconds + some_seconds" have
> elapsed. Since it may take "some_more_seconds" to write data to disk, if
> you are out of luck, "max_seconds + some_seconds + some_more_seconds" will
> exceed the maximum allowed time by the batch queue (or, more exactly, the
> time after which the batch queue realizes that you are out of time: in your
> run, 86427s, or 27s more than the wall time limit, 86400).
>
> Unfortunately there is no way you can recover your data. And no, there is
> no reliable way to ask the operating system "how much time do I have"
> before starting a new diagonalization ...
>
> Paolo
>
> On Mon, Feb 15, 2016 at 6:11 PM, Joshua Davis <davis101 at chemistry.msu.edu>
> wrote:
>
>> Continued... (Sent before I meant to)
>>
>> I did try to use disk_io = "high", but I ran into "davcio (10)" read and
>> write errors so I just used the default "low" option.  There were wfc file
>> written in my outdir too.
>>
>> Below contains much of the control options I used:
>>
>> &CONTROL
>>    title = 'MgB5C2PP_NORMCON_HSE_ec140_5kp_115bnd_1Q',
>>    calculation = 'scf',
>>    pseudo_dir = './pot',
>>    outdir = './scratch',
>>    prefix = 'MgB5CPP_NC_PBE_ec140_5kp_115bnd',
>>    etot_conv_thr = 1.0D-5,
>>    forc_conv_thr = 1.0D-4,
>>    verbosity = 'high',
>>    wf_collect = .true.,
>>    max_seconds = 84600
>>  /
>>
>>  &SYSTEM
>>    ibrav = 0,
>>    nat = 52,
>>    ntyp = 3,
>>    ecutwfc = 140,
>>    nspin = 1,
>>    occupations = 'fixed',
>>    nbnd = 115,
>>    input_dft = 'hse',
>>    screening_parameter = 0.106,
>>    nqx1 = 1, nqx2 = 1, nqx3 = 1
>>  /
>>
>>  &ELECTRONS
>>    mixing_beta = 0.7,
>>    conv_thr = 1.D-8,
>>    electron_maxstep = 200
>>  /
>>
>>
>> ATOMIC_SPECIES
>>  Mg 24.305 Mg.pbe-hgh.UPF
>>  B 10.81 B.pbe-hgh.UPF
>>  C 12.011 C.pbe-hgh.UPF
>>
>>
>> K_POINTS (automatic)
>>  5 5 5  0 0 0
>>
>> The calculation ended with:
>>
>>     100 total processes killed (some possibly by mpirun during cleanup)
>>
>> in the out file, and the following was in the scheduler output file:
>>
>>     mpirun: killing job...
>>
>>
>> --------------------------------------------------------------------------
>>     mpirun noticed that process rank 0 with PID 26679 on node scw-003
>> exited on signal 0 (Unknown signal 0).
>>
>> --------------------------------------------------------------------------
>>     =>> PBS: job killed: walltime 86427 exceeded limit 86400
>>     mpirun: abort is already in progress...hit ctrl-c again to forcibly
>> terminate
>>
>>
>>
>> Other info:   The system runs CentOS 6.6, and I am running QE5.3 compiled
>> with ifort 13.01
>>
>> Any help would be much appreciated.
>>
>> ----------------------------------------------------------------------------------------------------------------
>> Joshua D. Davis
>>
>> Graduate Assistant
>> Department of Chemistry
>> Michigan State University
>>
>> -----------------------------------------------------------------------------------------------------------------
>>
>> On Mon, Feb 15, 2016 at 11:55 AM, Joshua Davis <
>> davis101 at chemistry.msu.edu> wrote:
>>
>>> Dear pwscf fourm,
>>>
>>> I am currently trying to run an HSE calculation on my university's high
>>> performance cluster.  To make sure the density and wave-functions are
>>> written properly before scheduled session ends I usually use max_seconds to
>>> stop the calculation.  The max_seconds function did stop the calculation
>>> and was ended by the scheduler.  Can I still  use the wave-function files
>>> even though the calculation did not end right?
>>>
>>> The default disk_io is set to the default "low".  I did try to use
>>> disk_io = "high", but I ran into "davcio (10)"
>>>
>>>
>>> ----------------------------------------------------------------------------------------------------------------
>>> Joshua D. Davis
>>>
>>> Graduate Assistant
>>> Michigan State University
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------------------
>>>
>>
>>
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://pwscf.org/mailman/listinfo/pw_forum
>>
>
>
>
> --
> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> Phone +39-0432-558216, fax +39-0432-558222
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20160215/0de60248/attachment.html>


More information about the users mailing list