[Pw_forum] possible i/o bug in turbo_lanczos.x and turbo_davidson.x 5.3.0

Giuseppe Mattioli giuseppe.mattioli at ism.cnr.it
Fri Feb 5 15:16:41 CET 2016


Dear Iurii
Thank you. I'm less than a dummy with allocation/parallelization etc. issues. Otherwise I would be glad to help...
Best
Giuseppe

> Really? And there is no problem on Linux?

no, not at all...;-)

On Friday, February 05, 2016 01:07:22 PM Timrov Iurii wrote:
> Dear Giuseppe,
> 
> I am going to check if there is some extra allocations and/or a memory leak, when I have some time.
> 
> > Indeed!!! Only microsoft windows requires more memory to do the same thing in newer versions!!!!
> 
> Really? And there is no problem on Linux?
> 
> Best regards,
> Iurii
> 
> --
> Dr. Iurii Timrov
> Postdoctoral Researcher
> École Polytechnique Fédérale de Lausanne,
> Theory and Simulation of Materials
> CH-1015 Lausanne, Switzerland
> +41 21 69 34 881
> http://people.epfl.ch/265334
> 
> ________________________________________
> From: Giuseppe Mattioli <giuseppe.mattioli at ism.cnr.it>
> Sent: Friday, February 5, 2016 1:55 PM
> To: Timrov Iurii
> Cc: pw_forum at pwscf.org
> Subject: Re: Re: Re: [Pw_forum] possible i/o bug in turbo_lanczos.x and turbo_davidson.x        5.3.0
> 
> Dear Iurii
> 
> > It seems that this is a RAM issue.
> 
> Maybe something connected with memory allocation which is changed between 5.1.1 and 5.2.1
> 
> > I runned your test case with QE-5.2.1 on my local workstation with 8 cores and 64 Gb RAM and the Lanczos code crashed. When I changed the input of
> > PWscf so that only the occupied states are computed (actually, the empty states are not need in the Lanczos calculation), which of course
> > decreased
> > RAM requirements, the code didn't crash.
> 
> the code didn't crash even on my 8 cores 16 GB RAM cluster with 5.1.1. And it is not a large benchmark. I used to run larger ones on the same node
> and far larger ones on two nodes of the same machine with older versions. The problem cannot be due to the overall memory requirements, but to some
> problematic memory allocation (a leak somewhere?).
> 
> > If this is indeed the reason of the problem, then it seems strange to me why the QE-5.1.1 does not have this problem. Maybe some investigation
> > would be desired.
> 
> Indeed!!! Only microsoft windows requires more memory to do the same thing in newer versions!!!!
> 
> Best
> Giuseppe
> 
> On Friday, February 05, 2016 10:07:14 AM Timrov Iurii wrote:
> > Dear Giuseppe,
> > 
> > It seems that this is a RAM issue.
> > 
> > I runned your test case with QE-5.2.1 on my local workstation with 8 cores and 64 Gb RAM and the Lanczos code crashed. When I changed the input of
> > PWscf so that only the occupied states are computed (actually, the empty states are not need in the Lanczos calculation), which of course
> > decreased
> > RAM requirements, the code didn't crash.
> > 
> > In Bluegene machine you may try to optimize RAM too. Maybe you know, one can allocate all cores per node but use only a few of them which would
> > allow you to increase RAM per core. Please note that with Davidson the RAM requirements are even much larger, so it is not easy to optimize the
> > script for submission the jobs for large systems.
> > 
> > In order to verify if you also have a memory issue, you may try to decrease the value of celldm(1), cutoffs etc. and see if the code does not
> > crash.
> > 
> > If this is indeed the reason of the problem, then it seems strange to me why the QE-5.1.1 does not have this problem. Maybe some investigation
> > would be desired.
> > 
> > HTH
> > 
> > Best regards,
> > Iurii
> > 
> > --
> > Dr. Iurii Timrov
> > Postdoctoral Researcher
> > École Polytechnique Fédérale de Lausanne,
> > Theory and Simulation of Materials
> > CH-1015 Lausanne, Switzerland
> > +41 21 69 34 881
> > http://people.epfl.ch/265334
> > 
> > ________________________________________
> > From: Giuseppe Mattioli <giuseppe.mattioli at ism.cnr.it>
> > Sent: Thursday, February 4, 2016 5:59 PM
> > To: Timrov Iurii
> > Cc: pw_forum at pwscf.org
> > Subject: Re: Re: [Pw_forum] possible i/o bug in turbo_lanczos.x and turbo_davidson.x    5.3.0
> > 
> > Silent crash on bluegene with 5.2.1 (I have no time to compile 5.3.0 now. I may try tomorrow if you think it is important).
> > 
> >      Program turboTDDFT v.5.2.1 starts on  4Feb2016 at 17:56:55
> >      
> >      This program is part of the open-source Quantum ESPRESSO suite
> >      for quantum simulation of materials; please cite
> >      
> >          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
> >          
> >           URL http://www.quantum-espresso.org",
> >      
> >      in publications or presentations arising from this work. More details at
> >      http://www.quantum-espresso.org/quote
> >      
> >      Parallel version (MPI & OpenMP), running on    2048 processor cores
> >      Number of MPI processes:               512
> >      Threads/MPI process:                     4
> >      R & G space division:  proc/nbgrp/npool/nimage =     512
> >      
> >      Reading data from directory:
> >      /gpfs/scratch/userexternal/gmattiol/test/tddft/run/tmp/././l0-5.3.0.save
> >    
> >    Info: using nr1, nr2, nr3 values from input
> >    
> >    Info: using nr1, nr2, nr3 values from input
> >    
> >      IMPORTANT: XC functional enforced from input :
> >      Exchange-correlation      =  SLA  PW   PBE  PBE ( 1  4  3  4 0 0)
> >      Any further DFT definition will be discarded
> >      Please, verify this is what you really want
> >      
> >      
> >      Parallelization info
> >      --------------------
> >      sticks:   dense  smooth     PW     G-vecs:    dense   smooth      PW
> >      Min          78      38      8                12054     4220     492
> >      Max          80      40     10                12104     4300     550
> >      Sum       40733   20369   5097              6186431  2186841  273425
> >      Tot       20367   10185   2549
> >      
> >      
> >      negative rho (up, down):  9.597E-02 0.000E+00
> >      
> >      Subspace diagonalization in iterative solution of the eigenvalue problem:
> >      scalapack distributed-memory algorithm (size of sub-group: 16* 16 procs)
> >      
> >      
> >      Warning: There are virtual states in the input file, trying to disregard in response calculation
> >      
> >      Ultrasoft (Vanderbilt) Pseudopotentials
> >      
> >      Normal read
> >      
> >      Gamma point algorithm
> > 
> > 2016-02-04 17:57:18.063 (WARN ) [0x40000ee8d50] :7014845:ibm.runjob.client.Job: terminated by signal 6
> > 2016-02-04 17:57:18.065 (WARN ) [0x40000ee8d50] :7014845:ibm.runjob.client.Job: abnormal termination by signal 6 from rank 295
> > 
> > On Thursday, February 04, 2016 03:46:10 PM Timrov Iurii wrote:
> > > Dear Giuseppe,
> > > 
> > > As far as I understand the code crashes when it tries to write the vectors "d0psi" to the disc. First thing to do, I think, is to check that you
> > > have enough space on the disc. If this is not the issue, then let's continue looking for a reason.
> > > 
> > > You may want to look in the routine TDDFPT/src/lr_solve_e.f90 at lines 110-138 where the code writes vectors to the disc in parallel. Please
> > > make
> > > sure that the "outdir" is the same in PWscf and in Lanczos/Davidson (and don't specify wfcdir). If this does not solve the problem, could you
> > > report please also the output of Lanczos/Davidson (better Lanczos)?
> > > 
> > > HTH
> > > 
> > > Best regards,
> > > Iurii Timrov
> > > Post-doctoral researcher
> > > THEOS - École Polytechnique Fédérale de Lausanne
> > > Lausanne, Switzerland
> > > 
> > > 
> > > ________________________________________
> > > From: pw_forum-bounces at pwscf.org <pw_forum-bounces at pwscf.org> on behalf of Giuseppe Mattioli <giuseppe.mattioli at ism.cnr.it>
> > > Sent: Thursday, February 4, 2016 11:34 AM
> > > To: pw_forum at pwscf.org
> > > Subject: [Pw_forum] possible i/o bug in turbo_lanczos.x and turbo_davidson.x    5.3.0
> > > 
> > > Dear All
> > > I'm having problems when performing nontrivial runs of turbo_davidson.x and turbo_lanczos.x with 5.2.1 and 5.3.0 versions of QE.
> > > Let me say first that "trivial" runs (CH4 molecule with same pseudopotentials and cutoffs but a smaller 30 a.u.^3 cubic cell) work fine with all
> > > the tested versions.
> > > However, the input files for a nontrivial case that leads to crash should run on a decent pc in about 1 hr, so they provide a significant but
> > > not
> > > huge test. *Note* that if I run the same input files with the 5.1.1 version (compiled against the very same environment) everything goes (more
> > > slowly but) fine! The 5.3.0 (and 5.2.1) crashes have been reproduced on two different machines (intel 8 cores 16GB RAM, amd 32 cores 64 GB RAM),
> > > so
> > > they should not be considered as erratic.
> > > 
> > > here is the pw.x run. The PPs are quite old and can be found in the online library (or provided by me on demand).
> > > 
> > >  &control
> > >  
> > >     calculation = 'scf'
> > >     restart_mode='from_scratch',
> > >     prefix='l0-5.3.0',
> > >     pseudo_dir = '/home/mattioligi/PP_UPF/',
> > >     outdir='/home/mattioligi/cocat/test_tddft/5.2.1/l0/5.3.0/run/tmp/',
> > >     nstep=300,
> > >     max_seconds=80000,
> > >     disk_io='low',
> > >     tprnfor=.true.,
> > >  
> > >  /
> > >  &system
> > >  
> > >     ibrav=1, celldm(1)=40.000000,
> > >     nat=42, ntyp=4, nbnd=75,
> > >     ecutwfc = 40.0,
> > >     ecutrho = 320.0,
> > >     nspin=1,
> > >  
> > >  /
> > >  &electrons
> > >  
> > >     diagonalization='david',
> > >     mixing_mode='plain'
> > >     mixing_beta=0.1
> > >     conv_thr=1.0d-8
> > >     electron_maxstep=100
> > >  
> > >  /
> > >  &ions
> > >  /
> > > 
> > > ATOMIC_SPECIES
> > > O    15.999    O_pbe.van.UPF
> > > N    14.007    N.pbe-van_bm.UPF
> > > C    12.011    C_pbe.van.UPF
> > > H     1.008    H_pbe.van.UPF
> > > ATOMIC_POSITIONS {angstrom}
> > > C        4.815369179  12.355337788   8.111406911
> > > C        5.639537337  12.072210478   7.018248617
> > > C        6.373883049  10.886794669   6.974735758
> > > H        5.707874252  12.778745273   6.179910928
> > > C        4.734413944  11.441350355   9.166316558
> > > H        4.235443595  13.287281698   8.140567718
> > > C        6.304598307   9.977077773   8.041477142
> > > H        7.012644682  10.659891408   6.111132336
> > > C        5.477180541  10.260422385   9.138835842
> > > H        4.092409998  11.653694694  10.031778418
> > > H        5.418528381   9.546881383   9.971310698
> > > N        7.058612774   8.759574945   8.006208499
> > > C        6.384981399   7.544139013   8.340645249
> > > C        6.997532612   6.588483316   9.168188787
> > > C        5.084708421   7.308024697   7.864810575
> > > C        6.325550737   5.410241765   9.493833204
> > > H        8.006262126   6.776794433   9.557919083
> > > C        4.414663626   6.134355690   8.210976959
> > > H        4.597637090   8.055249046   7.224770074
> > > C        5.030975670   5.176070562   9.020776666
> > > H        6.819890970   4.670618768  10.138154855
> > > H        3.397721512   5.964689741   7.832306200
> > > H        4.503298572   4.249946635   9.284425745
> > > C        8.412602212   8.773905175   7.652414992
> > > C        9.197305040   9.938168667   7.841458619
> > > C        9.043381168   7.634703664   7.098599788
> > > C       10.533008285   9.972397555   7.486007356
> > > H        8.740413757  10.828552107   8.290447985
> > > C       10.383506998   7.674400214   6.758021800
> > > H        8.466388332   6.717306584   6.931252215
> > > C       11.175184928   8.838234071   6.927523312
> > > H       11.098162573  10.894629696   7.663657304
> > > H       10.849606517   6.778483121   6.322529487
> > > C       12.554045113   8.768090174   6.529797787
> > > C       13.538745611   9.729179498   6.474718127
> > > H       12.882286114   7.769870632   6.203237321
> > > C       13.338246843  11.096686263   6.810664645
> > > N       13.160471613  12.223162736   7.083088078
> > > C       14.914360413   9.407055683   6.034105289
> > > O       15.832284936  10.221452163   5.956798921
> > > O       15.091537629   8.085358800   5.710801225
> > > H       16.043983143   8.016066678   5.436328923
> > > K_POINTS {gamma}
> > > 
> > > And here are the turbo_lanczos.x and turbo davidson.x input files
> > > 
> > > lanczos
> > > 
> > > &lr_input
> > > 
> > >     prefix="l0-5.3.0",
> > >     outdir='/state/partition1/mattioligi/34339',
> > >     wfcdir='/state/partition1/mattioligi/34339',
> > >     restart_step=6,
> > >     restart=.false.
> > > 
> > > /
> > > &lr_control
> > > 
> > >     itermax=12,
> > >     ipol=4,
> > > 
> > > /
> > > 
> > > davidson
> > > 
> > > &lr_input
> > > 
> > >     prefix="l0-5.3.0",
> > >     outdir='/state/partition1/mattioligi/34340',
> > >     restart=.false.
> > > 
> > > /
> > > &lr_dav
> > > 
> > >     num_eign=2
> > >     num_init=4
> > >     num_basis_max=10
> > >     residue_conv_thr=1.0E-4
> > >     start=0.1
> > >     finish=1.5
> > >     step=0.0002
> > >     broadening=0.005
> > >     reference=0.2
> > >     p_nbnd_occ=5
> > >     p_nbnd_virt=5
> > >     poor_of_ram=.false.
> > >     poor_of_ram2=.false.
> > > 
> > > /
> > > 
> > > In both cases and on both machines the CRASH report is something like
> > > 
> > >  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> > >  
> > >      task #         1
> > >      from davcio : error #        20
> > >      error while writing from file "/state/partition1/mattioligi/34340/l0-5.3.0.d0psi.32"
> > >  
> > >  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> > > 
> > > I suppose that it is some kind of I/O error, but I warmly require your opinion...:-)
> > > Thank you in advance
> > > Giuseppe
> > > 
> > > ********************************************************
> > > - Article premier - Les hommes naissent et demeurent
> > > libres et égaux en droits. Les distinctions sociales
> > > ne peuvent être fondées que sur l'utilité commune
> > > - Article 2 - Le but de toute association politique
> > > est la conservation des droits naturels et
> > > imprescriptibles de l'homme. Ces droits sont la liberté,
> > > la propriété, la sûreté et la résistance à l'oppression.
> > > ********************************************************
> > > 
> > >    Giuseppe Mattioli
> > >    CNR - ISTITUTO DI STRUTTURA DELLA MATERIA
> > >    v. Salaria Km 29,300 - C.P. 10
> > >    I 00015 - Monterotondo Stazione (RM), Italy
> > >    Tel + 39 06 90672836 - Fax +39 06 90672316
> > >    E-mail: <giuseppe.mattioli at ism.cnr.it>
> > >    http://www.ism.cnr.it/english/staff/mattiolig
> > >    ResearcherID: F-6308-2012
> > > 
> > > _______________________________________________
> > > Pw_forum mailing list
> > > Pw_forum at pwscf.org
> > > http://pwscf.org/mailman/listinfo/pw_forum
> > 
> > ********************************************************
> > - Article premier - Les hommes naissent et demeurent
> > libres et égaux en droits. Les distinctions sociales
> > ne peuvent être fondées que sur l'utilité commune
> > - Article 2 - Le but de toute association politique
> > est la conservation des droits naturels et
> > imprescriptibles de l'homme. Ces droits sont la liberté,
> > la propriété, la sûreté et la résistance à l'oppression.
> > ********************************************************
> > 
> >    Giuseppe Mattioli
> >    CNR - ISTITUTO DI STRUTTURA DELLA MATERIA
> >    v. Salaria Km 29,300 - C.P. 10
> >    I 00015 - Monterotondo Stazione (RM), Italy
> >    Tel + 39 06 90672836 - Fax +39 06 90672316
> >    E-mail: <giuseppe.mattioli at ism.cnr.it>
> >    http://www.ism.cnr.it/english/staff/mattiolig
> >    ResearcherID: F-6308-2012
> 
> ********************************************************
> - Article premier - Les hommes naissent et demeurent
> libres et égaux en droits. Les distinctions sociales
> ne peuvent être fondées que sur l'utilité commune
> - Article 2 - Le but de toute association politique
> est la conservation des droits naturels et
> imprescriptibles de l'homme. Ces droits sont la liberté,
> la propriété, la sûreté et la résistance à l'oppression.
> ********************************************************
> 
>    Giuseppe Mattioli
>    CNR - ISTITUTO DI STRUTTURA DELLA MATERIA
>    v. Salaria Km 29,300 - C.P. 10
>    I 00015 - Monterotondo Stazione (RM), Italy
>    Tel + 39 06 90672836 - Fax +39 06 90672316
>    E-mail: <giuseppe.mattioli at ism.cnr.it>
>    http://www.ism.cnr.it/english/staff/mattiolig
>    ResearcherID: F-6308-2012

********************************************************
- Article premier - Les hommes naissent et demeurent
libres et égaux en droits. Les distinctions sociales
ne peuvent être fondées que sur l'utilité commune
- Article 2 - Le but de toute association politique
est la conservation des droits naturels et 
imprescriptibles de l'homme. Ces droits sont la liberté,
la propriété, la sûreté et la résistance à l'oppression.
********************************************************

   Giuseppe Mattioli                            
   CNR - ISTITUTO DI STRUTTURA DELLA MATERIA   
   v. Salaria Km 29,300 - C.P. 10                
   I 00015 - Monterotondo Stazione (RM), Italy    
   Tel + 39 06 90672836 - Fax +39 06 90672316    
   E-mail: <giuseppe.mattioli at ism.cnr.it>
   http://www.ism.cnr.it/english/staff/mattiolig
   ResearcherID: F-6308-2012




More information about the users mailing list