[Pw_forum] Geometry optimization on QE530-GPU with memory allocation error?

Filippo Spiga filippo.spiga at quantum-espresso.org
Tue Feb 16 08:39:14 CET 2016


Rolly,

I assume you use some sort of script to submit or run your calculation. Do not run for 500K seconds, split this run in a sequence of short ones and keep the max_time within 12h~24h. In this way you always have a chance, if something goes wrong in one run of your long relaxation calculation, to resume safely without need to recompute too much.

This suggestion is driven by common sense, not because of how QE or QE-GPU work.

HTH

--
Filippo SPIGA
* Sent from my iPhone, sorry for typos *

> On 16 Feb 2016, at 07:01, Rolly Ng <rollyng at gmail.com> wrote:
> 
> Dear Paolo,
>  
> Thank you for the clarification, I will give it a trial.
>  
> Regards,
> Rolly
>  
> PhD, Research Fellow,
> Department of Physics and Materials Science,
> City University of Hong Kong
> Tel: +852 3442 4000
> Fax:+852 3442 0538
>  
> From: pw_forum-bounces at pwscf.org [mailto:pw_forum-bounces at pwscf.org] On Behalf Of Paolo Giannozzi
> Sent: Tuesday, February 16, 2016 2:51 PM
> To: PWSCF Forum
> Subject: Re: [Pw_forum] Geometry optimization on QE530-GPU with memory allocation error?
>  
> You do not need to update atomic coordinates: the code will read and use the latest set of coordinates if you restart from a previous run (after a clean stop)
> 
> Paolo
>  
> On Tue, Feb 16, 2016 at 6:39 AM, Rolly Ng <rollyng at gmail.com> wrote:
> Dear Filippo,
>  
> Thanks for the quick tip.
>  
> I would like to know the correct method of stop-restart a geometry optimization.
>  
> 1)      Initially, add  max_seconds = 500000 to the &CONTROL section
> 
> 2)      Add restart_mode = from_scractch to the &CONTROL section
> 
> 3)      Run pw-gpu.x and wait for the run to stop after 500000 seconds
> 
> 4)      Modify restart_mode = restart to the &CONTROL section
> 
> 5)      Rerun pw-gpu.x and wait for the run to stop after 500000 seconds
> 
>  
> What I am not sure is the coordinates of atoms for restarting the calculation? Since I am doing  geometry optimization, the positions of the atoms does change and do I need to update the latest coordinates at the 500000 seconds manually? And how can I do that?
>  
> Thanks,
> Rolly
>  
> PhD, Research Fellow,
> Department of Physics and Materials Science,
> City University of Hong Kong
> Tel: +852 3442 4000
> Fax:+852 3442 0538
>  
> From: pw_forum-bounces at pwscf.org [mailto:pw_forum-bounces at pwscf.org] On Behalf Of Filippo Spiga
> Sent: Tuesday, February 16, 2016 12:20 PM
> To: PWSCF Forum
> Subject: Re: [Pw_forum] Geometry optimization on QE530-GPU with memory allocation error?
>  
> Dear Rolly,
>  
> sorry to hear about your problem, I imagine the frustration of losing so much time and being unable to recover because of an error happened in the middle of a SCF step. It is hard to guess what went wrong at that point, especially after the calculation run continuously on multiple GPU for almost 7 days without stop.
>  
> Just a consideration, valid with or without GPU: unless not possible, _never_ run continuously for so long. It is a bad idea for multiple reasons. Always safely checkpoit/restart your calculation more often.
>  
> Cheers
>  
> --
> Filippo SPIGA
> * Sent from my iPhone, sorry for typos *
> 
> On 16 Feb 2016, at 04:01, Rolly Ng <rollyng at gmail.com> wrote:
> 
> Dear Filippo and QE-GPU users,
>  
> I am running a geometry optimization and the system contains 128 atoms. It runs fine but until the time spent reaches 590,000 seconds it stops with the error, and the job fails to complete L and I have this error 3 times for 3 different cases.
>  
> “Error in memory allocation, program will be terminated (2) !!! Bye…”
>  
> I can confirm the error only appear after running for more than 560,000 seconds, so all the previous effort was wasted L if I cannot restart the optimization L.
>  
> I have not seen such problem with QE520-GPU or may be my previous runs did not last for so long.
>  
> Could you please check my input file? Thank you!
>  
> &CONTROL
>                 calculation = 'relax' ,
>                 outdir = '/home/zgdeng/Rolly/TiNSurf200',
>                 pseudo_dir = '/home/zgdeng/SSSP_acc_PBE' ,                                                 
> prefix = 'TiNSurf200+Biotin',
>                 verbosity = 'low' ,
>                etot_conv_thr = 1.0D-3 ,
>                forc_conv_thr = 1.0D-2 ,
>                 nstep = 100 ,
>                 tstress = .false. ,
>                 tprnfor = .false. ,
> /
> &SYSTEM
>                 ibrav = 14,
> celldm(1) = 22.9288029598d0, celldm(2)=1.2990423130d0, celldm(3)=5.2512156527d0,
>                 celldm(4) = 0.0000000000d0, celldm(5)=0.0000000000d0, celldm(6)=0.0000000000d0,
>                 nat = 128,
>                 ntyp = 6,
>                 ecutwfc = 30d0 ,
>                 ecutrho = 240d0 ,
>                 nosym = .true. ,
>                 nbnd = 600,
>                input_dft = 'PBE' ,
>                 occupations = 'smearing' ,
>                 degauss = 0.015d0 ,
>                smearing = 'gaussian' ,
> /
> &ELECTRONS
>                 electron_maxstep = 1000,
>                 conv_thr = 1d-06 ,
>                 mixing_mode = 'local-TF' ,
>                 mixing_beta = 0.300d0 ,
>                 diagonalization = 'david' ,
> /
>   &IONS
>                ion_dynamics = 'bfgs' ,
>                upscale = 100.D0 ,
>                bfgs_ndim = 3 ,
> /
> ATOMIC_SPECIES
>                 C 12.010700d0 C_pbe_v1.2.uspp.F.UPF
>                 H 1.007940d0 H.pbe-rrkjus_psl.0.1.UPF
>                 N 14.006700d0 N.pbe.theos.UPF
> O 15.999400d0 O.pbe-n-kjpaw_psl.0.1.UPF
>                 S 32.065000d0 S_pbe_v1.2.uspp.F.UPF
>                 Ti 47.867000d0 ti_pbe_v1.4.uspp.F.UPF
> ATOMIC_POSITIONS {alat}
>                 Ti   0.0000000000d0   0.0000000000d0   0.1021361444d0   0   0   0
> Ti   0.1250000000d0   0.2165113823d0   0.1021361444d0   0   0   0
> Ti   0.0000000000d0   0.1443365914d0   0.3062508969d0   1   1   1
> Ti   0.1250000000d0   0.3608479737d0   0.3062508969d0   1   1   1
> N    0.0000000000d0   0.1443365914d0   0.0001050243d0   0   0   0
> N    0.1250000000d0   0.3608479737d0   0.0001050243d0   0   0   0
> N    0.1250000000d0   0.0721747909d0   0.2042197767d0   1   1   1
> N    0.0000000000d0   0.2886731828d0   0.2042197767d0   1   1   1
> Ti   0.2500000000d0   0.0000000000d0   0.1021361444d0   0   0   0
>                 Ti   0.3750000000d0   0.2165113823d0   0.1021361444d0   0   0   0
>                 Ti   0.2500000000d0   0.1443365914d0   0.3062508969d0   1   1   1
>                 Ti   0.3750000000d0   0.3608479737d0   0.3062508969d0   1   1   1
>                 N    0.2500000000d0   0.1443365914d0   0.0001050243d0   0   0   0
>                 N    0.3750000000d0   0.3608479737d0   0.0001050243d0   0   0   0
>                 N    0.3750000000d0   0.0721747909d0   0.2042197767d0   1   1   1
>                 N    0.2500000000d0   0.2886731828d0   0.2042197767d0   1   1   1
>                 Ti   0.5000000000d0   0.0000000000d0   0.1021361444d0   0   0   0
>                 Ti   0.6250000000d0   0.2165113823d0   0.1021361444d0   0   0   0
>                 Ti   0.5000000000d0   0.1443365914d0   0.3062508969d0   1   1   1
>                 Ti   0.6250000000d0   0.3608479737d0   0.3062508969d0   1   1   1
>                 N    0.5000000000d0   0.1443365914d0   0.0001050243d0   0   0   0
>                 N    0.6250000000d0   0.3608479737d0   0.0001050243d0   0   0   0
>                 N    0.6250000000d0   0.0721747909d0   0.2042197767d0   1   1   1
>                 N    0.5000000000d0   0.2886731828d0   0.2042197767d0   1   1   1
>                 Ti   0.7500000000d0   0.0000000000d0   0.1021361444d0   0   0   0
>                 Ti   0.8750000000d0   0.2165113823d0   0.1021361444d0   0   0   0
>                 Ti   0.7500000000d0   0.1443365914d0   0.3062508969d0   1   1   1
>                 Ti   0.8750000000d0   0.3608479737d0   0.3062508969d0   1   1   1
>                 N    0.7500000000d0   0.1443365914d0   0.0001050243d0   0   0   0
>                 N    0.8750000000d0   0.3608479737d0   0.0001050243d0   0   0   0
>                 N    0.8750000000d0   0.0721747909d0   0.2042197767d0   1   1   1
>                 N    0.7500000000d0   0.2886731828d0   0.2042197767d0   1   1   1
>                 Ti   0.0000000000d0   0.4330097742d0   0.1021361444d0   0   0   0
>                 Ti   0.1250000000d0   0.6495211565d0   0.1021361444d0   0   0   0
>                 Ti   0.0000000000d0   0.5773463656d0   0.3062508969d0   1   1   1
>                 Ti   0.1250000000d0   0.7938577479d0   0.3062508969d0   1   1   1
>                 N    0.0000000000d0   0.5773463656d0   0.0001050243d0   0   0   0
>                 N    0.1250000000d0   0.7938577479d0   0.0001050243d0   0   0   0
>                 N    0.1250000000d0   0.5051845651d0   0.2042197767d0   1   1   1
>                 N    0.0000000000d0   0.7216959474d0   0.2042197767d0   1   1   1
>                 Ti   0.2500000000d0   0.4330097742d0   0.1021361444d0   0   0   0
>                 Ti   0.3750000000d0   0.6495211565d0   0.1021361444d0   0   0   0
>                 Ti   0.2500000000d0   0.5773463656d0   0.3062508969d0   1   1   1
>                 Ti   0.3750000000d0   0.7938577479d0   0.3062508969d0   1   1   1
>                 N    0.2500000000d0   0.5773463656d0   0.0001050243d0   0   0   0
>                 N    0.3750000000d0   0.7938577479d0   0.0001050243d0   0   0   0
>                 N    0.3750000000d0   0.5051845651d0   0.2042197767d0   1   1   1
>                 N    0.2500000000d0   0.7216959474d0   0.2042197767d0   1   1   1
>                 Ti   0.5000000000d0   0.4330097742d0   0.1021361444d0   0   0   0
>                 Ti   0.6250000000d0   0.6495211565d0   0.1021361444d0   0   0   0
>                 Ti   0.5000000000d0   0.5773463656d0   0.3062508969d0   1   1   1
>                 Ti   0.6250000000d0   0.7938577479d0   0.3062508969d0   1   1   1
>                 N    0.5000000000d0   0.5773463656d0   0.0001050243d0   0   0   0
>                 N    0.6250000000d0   0.7938577479d0   0.0001050243d0   0   0   0
>                 N    0.6250000000d0   0.5051845651d0   0.2042197767d0   1   1   1
>                 N    0.5000000000d0   0.7216959474d0   0.2042197767d0   1   1   1
>                 Ti   0.7500000000d0   0.4330097742d0   0.1021361444d0   0   0   0
>                 Ti   0.8750000000d0   0.6495211565d0   0.1021361444d0   0   0   0
>                 Ti   0.7500000000d0   0.5773463656d0   0.3062508969d0   1   1   1
>                 Ti   0.8750000000d0   0.7938577479d0   0.3062508969d0   1   1   1
>                 N    0.7500000000d0   0.5773463656d0   0.0001050243d0   0   0   0
>                 N    0.8750000000d0   0.7938577479d0   0.0001050243d0   0   0   0
>                 N    0.8750000000d0   0.5051845651d0   0.2042197767d0   1   1   1
>                 N    0.7500000000d0   0.7216959474d0   0.2042197767d0   1   1   1
>                 Ti   0.0000000000d0   0.8660325388d0   0.1021361444d0   0   0   0
>                 Ti   0.1250000000d0   1.0825309307d0   0.1021361444d0   0   0   0
>                 Ti   0.0000000000d0   1.0103691302d0   0.3062508969d0   1   1   1
>                 Ti   0.1250000000d0   1.2268675220d0   0.3062508969d0   1   1   1
>                 N    0.0000000000d0   1.0103691302d0   0.0001050243d0   0   0   0
>                 N    0.1250000000d0   1.2268675220d0   0.0001050243d0   0   0   0
>                 N    0.1250000000d0   0.9381943393d0   0.2042197767d0   1   1   1
>                 N    0.0000000000d0   1.1547057216d0   0.2042197767d0   1   1   1
>                 Ti   0.2500000000d0   0.8660325388d0   0.1021361444d0   0   0   0
>                 Ti   0.3750000000d0   1.0825309307d0   0.1021361444d0   0   0   0
>                 Ti   0.2500000000d0   1.0103691302d0   0.3062508969d0   1   1   1
>                 Ti   0.3750000000d0   1.2268675220d0   0.3062508969d0   1   1   1
>                 N    0.2500000000d0   1.0103691302d0   0.0001050243d0   0   0   0
>                 N    0.3750000000d0   1.2268675220d0   0.0001050243d0   0   0   0
>                 N    0.3750000000d0   0.9381943393d0   0.2042197767d0   1   1   1
>                 N    0.2500000000d0   1.1547057216d0   0.2042197767d0   1   1   1
>                 Ti   0.5000000000d0   0.8660325388d0   0.1021361444d0   0   0   0
>                 Ti   0.6250000000d0   1.0825309307d0   0.1021361444d0   0   0   0
>                 Ti   0.5000000000d0   1.0103691302d0   0.3062508969d0   1   1   1
>                 Ti   0.6250000000d0   1.2268675220d0   0.3062508969d0   1   1   1
>                 N    0.5000000000d0   1.0103691302d0   0.0001050243d0   0   0   0
>                 N    0.6250000000d0   1.2268675220d0   0.0001050243d0   0   0   0
>                 N    0.6250000000d0   0.9381943393d0   0.2042197767d0   1   1   1
>                 N    0.5000000000d0   1.1547057216d0   0.2042197767d0   1   1   1
>                 Ti   0.7500000000d0   0.8660325388d0   0.1021361444d0   0   0   0
>                 Ti   0.8750000000d0   1.0825309307d0   0.1021361444d0   0   0   0
>                 Ti   0.7500000000d0   1.0103691302d0   0.3062508969d0   1   1   1
>                 Ti   0.8750000000d0   1.2268675220d0   0.3062508969d0   1   1   1
>                 N    0.7500000000d0   1.0103691302d0   0.0001050243d0   0   0   0
>                 N    0.8750000000d0   1.2268675220d0   0.0001050243d0   0   0   0
>                 N    0.8750000000d0   0.9381943393d0   0.2042197767d0   1   1   1
>                 N    0.7500000000d0   1.1547057216d0   0.2042197767d0   1   1   1
>                 N    0.4062600000d0   0.9896104340d0   0.6937906120d0   1   1   1
>                 C    0.4092000000d0   0.9020160108d0   0.6045199459d0   1   1   1
>                 C    0.4577300000d0   0.7953906178d0   0.6618107087d0   1   1   1
>                 N    0.4939900000d0   0.8337513373d0   0.7754470154d0   1   1   1
>                 C    0.4605200000d0   0.9497168446d0   0.7956116835d0   1   1   1
>                 C    0.5499000000d0   0.7467544736d0   0.5886612747d0   1   1   1
>                 S    0.5127800000d0   0.7970274111d0   0.4537050324d0   1   1   1
>                 C    0.4869600000d0   0.9325824765d0   0.5090003332d0   1   1   1
>                 C    0.5593700000d0   0.6202537332d0   0.5940700268d0   1   1   1
>                 C    0.5857900000d0   0.5794118428d0   0.7112246480d0   1   1   1
>                 C    0.5913300000d0   0.4526253131d0   0.7064460418d0   1   1   1
>                 C    0.6159700000d0   0.4036254371d0   0.8208700308d0   1   1   1
>                 C    0.6181100000d0   0.2770987158d0   0.8104726238d0   1   1   1
>                 O    0.6709500000d0   0.2080416264d0   0.8994807291d0   1   1   1
>                 O    0.5738500000d0   0.2226038907d0   0.7076538214d0   1   1   1
>                 O    0.4792600000d0   1.0152795101d0   0.8997958021d0   1   1   1
>                 H    0.3676800000d0   1.0720216783d0   0.6843909360d0   1   1   1
>                 H    0.3244700000d0   0.8813742285d0   0.5695993618d0   1   1   1
>                 H    0.3864400000d0   0.7347123514d0   0.6695825079d0   1   1   1
>                 H    0.5416000000d0   0.7826340223d0   0.8344706794d0   1   1   1
>                 H    0.6311400000d0   0.7881549521d0   0.6112940141d0   1   1   1
>                 H    0.4487000000d0   0.9936374652d0   0.4486113532d0   1   1   1
>                 H    0.5677800000d0   0.9656950650d0   0.5436058444d0   1   1   1
>                 H    0.6272000000d0   0.5918826491d0   0.5355189723d0   1   1   1
>                 H    0.4775600000d0   0.5827503816d0   0.5669737540d0   1   1   1
>                 H    0.5177200000d0   0.6062890283d0   0.7701432876d0   1   1   1
>                 H    0.6681700000d0   0.6144340236d0   0.7398437733d0   1   1   1
>                 H    0.6588100000d0   0.4267094190d0   0.6464771590d0   1   1   1
>                 H    0.5087600000d0   0.4194737533d0   0.6762515517d0   1   1   1
>                 H    0.5487800000d0   0.4294374078d0   0.8812590108d0   1   1   1
>                 H    0.6993600000d0   0.4344257303d0   0.8513270816d0   1   1   1
>                 H    0.5063400000d0   0.2734743877d0   0.6728382616d0   1   1   1
> K_POINTS {automatic}
>                 4 4 1 0 0 0
>  
> <QE530-GPU memory error.png>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
> 
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
> 
> 
> 
> --
> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> Phone +39-0432-558216, fax +39-0432-558222
> 
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20160216/3412ffdf/attachment.html>


More information about the users mailing list