[Pw_forum] possible i/o bug in turbo_lanczos.x and turbo_davidson.x 5.3.0
Giuseppe Mattioli
giuseppe.mattioli at ism.cnr.it
Fri Feb 5 15:16:41 CET 2016
Dear Iurii
Thank you. I'm less than a dummy with allocation/parallelization etc. issues. Otherwise I would be glad to help...
Best
Giuseppe
> Really? And there is no problem on Linux?
no, not at all...;-)
On Friday, February 05, 2016 01:07:22 PM Timrov Iurii wrote:
> Dear Giuseppe,
>
> I am going to check if there is some extra allocations and/or a memory leak, when I have some time.
>
> > Indeed!!! Only microsoft windows requires more memory to do the same thing in newer versions!!!!
>
> Really? And there is no problem on Linux?
>
> Best regards,
> Iurii
>
> --
> Dr. Iurii Timrov
> Postdoctoral Researcher
> École Polytechnique Fédérale de Lausanne,
> Theory and Simulation of Materials
> CH-1015 Lausanne, Switzerland
> +41 21 69 34 881
> http://people.epfl.ch/265334
>
> ________________________________________
> From: Giuseppe Mattioli <giuseppe.mattioli at ism.cnr.it>
> Sent: Friday, February 5, 2016 1:55 PM
> To: Timrov Iurii
> Cc: pw_forum at pwscf.org
> Subject: Re: Re: Re: [Pw_forum] possible i/o bug in turbo_lanczos.x and turbo_davidson.x 5.3.0
>
> Dear Iurii
>
> > It seems that this is a RAM issue.
>
> Maybe something connected with memory allocation which is changed between 5.1.1 and 5.2.1
>
> > I runned your test case with QE-5.2.1 on my local workstation with 8 cores and 64 Gb RAM and the Lanczos code crashed. When I changed the input of
> > PWscf so that only the occupied states are computed (actually, the empty states are not need in the Lanczos calculation), which of course
> > decreased
> > RAM requirements, the code didn't crash.
>
> the code didn't crash even on my 8 cores 16 GB RAM cluster with 5.1.1. And it is not a large benchmark. I used to run larger ones on the same node
> and far larger ones on two nodes of the same machine with older versions. The problem cannot be due to the overall memory requirements, but to some
> problematic memory allocation (a leak somewhere?).
>
> > If this is indeed the reason of the problem, then it seems strange to me why the QE-5.1.1 does not have this problem. Maybe some investigation
> > would be desired.
>
> Indeed!!! Only microsoft windows requires more memory to do the same thing in newer versions!!!!
>
> Best
> Giuseppe
>
> On Friday, February 05, 2016 10:07:14 AM Timrov Iurii wrote:
> > Dear Giuseppe,
> >
> > It seems that this is a RAM issue.
> >
> > I runned your test case with QE-5.2.1 on my local workstation with 8 cores and 64 Gb RAM and the Lanczos code crashed. When I changed the input of
> > PWscf so that only the occupied states are computed (actually, the empty states are not need in the Lanczos calculation), which of course
> > decreased
> > RAM requirements, the code didn't crash.
> >
> > In Bluegene machine you may try to optimize RAM too. Maybe you know, one can allocate all cores per node but use only a few of them which would
> > allow you to increase RAM per core. Please note that with Davidson the RAM requirements are even much larger, so it is not easy to optimize the
> > script for submission the jobs for large systems.
> >
> > In order to verify if you also have a memory issue, you may try to decrease the value of celldm(1), cutoffs etc. and see if the code does not
> > crash.
> >
> > If this is indeed the reason of the problem, then it seems strange to me why the QE-5.1.1 does not have this problem. Maybe some investigation
> > would be desired.
> >
> > HTH
> >
> > Best regards,
> > Iurii
> >
> > --
> > Dr. Iurii Timrov
> > Postdoctoral Researcher
> > École Polytechnique Fédérale de Lausanne,
> > Theory and Simulation of Materials
> > CH-1015 Lausanne, Switzerland
> > +41 21 69 34 881
> > http://people.epfl.ch/265334
> >
> > ________________________________________
> > From: Giuseppe Mattioli <giuseppe.mattioli at ism.cnr.it>
> > Sent: Thursday, February 4, 2016 5:59 PM
> > To: Timrov Iurii
> > Cc: pw_forum at pwscf.org
> > Subject: Re: Re: [Pw_forum] possible i/o bug in turbo_lanczos.x and turbo_davidson.x 5.3.0
> >
> > Silent crash on bluegene with 5.2.1 (I have no time to compile 5.3.0 now. I may try tomorrow if you think it is important).
> >
> > Program turboTDDFT v.5.2.1 starts on 4Feb2016 at 17:56:55
> >
> > This program is part of the open-source Quantum ESPRESSO suite
> > for quantum simulation of materials; please cite
> >
> > "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
> >
> > URL http://www.quantum-espresso.org",
> >
> > in publications or presentations arising from this work. More details at
> > http://www.quantum-espresso.org/quote
> >
> > Parallel version (MPI & OpenMP), running on 2048 processor cores
> > Number of MPI processes: 512
> > Threads/MPI process: 4
> > R & G space division: proc/nbgrp/npool/nimage = 512
> >
> > Reading data from directory:
> > /gpfs/scratch/userexternal/gmattiol/test/tddft/run/tmp/././l0-5.3.0.save
> >
> > Info: using nr1, nr2, nr3 values from input
> >
> > Info: using nr1, nr2, nr3 values from input
> >
> > IMPORTANT: XC functional enforced from input :
> > Exchange-correlation = SLA PW PBE PBE ( 1 4 3 4 0 0)
> > Any further DFT definition will be discarded
> > Please, verify this is what you really want
> >
> >
> > Parallelization info
> > --------------------
> > sticks: dense smooth PW G-vecs: dense smooth PW
> > Min 78 38 8 12054 4220 492
> > Max 80 40 10 12104 4300 550
> > Sum 40733 20369 5097 6186431 2186841 273425
> > Tot 20367 10185 2549
> >
> >
> > negative rho (up, down): 9.597E-02 0.000E+00
> >
> > Subspace diagonalization in iterative solution of the eigenvalue problem:
> > scalapack distributed-memory algorithm (size of sub-group: 16* 16 procs)
> >
> >
> > Warning: There are virtual states in the input file, trying to disregard in response calculation
> >
> > Ultrasoft (Vanderbilt) Pseudopotentials
> >
> > Normal read
> >
> > Gamma point algorithm
> >
> > 2016-02-04 17:57:18.063 (WARN ) [0x40000ee8d50] :7014845:ibm.runjob.client.Job: terminated by signal 6
> > 2016-02-04 17:57:18.065 (WARN ) [0x40000ee8d50] :7014845:ibm.runjob.client.Job: abnormal termination by signal 6 from rank 295
> >
> > On Thursday, February 04, 2016 03:46:10 PM Timrov Iurii wrote:
> > > Dear Giuseppe,
> > >
> > > As far as I understand the code crashes when it tries to write the vectors "d0psi" to the disc. First thing to do, I think, is to check that you
> > > have enough space on the disc. If this is not the issue, then let's continue looking for a reason.
> > >
> > > You may want to look in the routine TDDFPT/src/lr_solve_e.f90 at lines 110-138 where the code writes vectors to the disc in parallel. Please
> > > make
> > > sure that the "outdir" is the same in PWscf and in Lanczos/Davidson (and don't specify wfcdir). If this does not solve the problem, could you
> > > report please also the output of Lanczos/Davidson (better Lanczos)?
> > >
> > > HTH
> > >
> > > Best regards,
> > > Iurii Timrov
> > > Post-doctoral researcher
> > > THEOS - École Polytechnique Fédérale de Lausanne
> > > Lausanne, Switzerland
> > >
> > >
> > > ________________________________________
> > > From: pw_forum-bounces at pwscf.org <pw_forum-bounces at pwscf.org> on behalf of Giuseppe Mattioli <giuseppe.mattioli at ism.cnr.it>
> > > Sent: Thursday, February 4, 2016 11:34 AM
> > > To: pw_forum at pwscf.org
> > > Subject: [Pw_forum] possible i/o bug in turbo_lanczos.x and turbo_davidson.x 5.3.0
> > >
> > > Dear All
> > > I'm having problems when performing nontrivial runs of turbo_davidson.x and turbo_lanczos.x with 5.2.1 and 5.3.0 versions of QE.
> > > Let me say first that "trivial" runs (CH4 molecule with same pseudopotentials and cutoffs but a smaller 30 a.u.^3 cubic cell) work fine with all
> > > the tested versions.
> > > However, the input files for a nontrivial case that leads to crash should run on a decent pc in about 1 hr, so they provide a significant but
> > > not
> > > huge test. *Note* that if I run the same input files with the 5.1.1 version (compiled against the very same environment) everything goes (more
> > > slowly but) fine! The 5.3.0 (and 5.2.1) crashes have been reproduced on two different machines (intel 8 cores 16GB RAM, amd 32 cores 64 GB RAM),
> > > so
> > > they should not be considered as erratic.
> > >
> > > here is the pw.x run. The PPs are quite old and can be found in the online library (or provided by me on demand).
> > >
> > > &control
> > >
> > > calculation = 'scf'
> > > restart_mode='from_scratch',
> > > prefix='l0-5.3.0',
> > > pseudo_dir = '/home/mattioligi/PP_UPF/',
> > > outdir='/home/mattioligi/cocat/test_tddft/5.2.1/l0/5.3.0/run/tmp/',
> > > nstep=300,
> > > max_seconds=80000,
> > > disk_io='low',
> > > tprnfor=.true.,
> > >
> > > /
> > > &system
> > >
> > > ibrav=1, celldm(1)=40.000000,
> > > nat=42, ntyp=4, nbnd=75,
> > > ecutwfc = 40.0,
> > > ecutrho = 320.0,
> > > nspin=1,
> > >
> > > /
> > > &electrons
> > >
> > > diagonalization='david',
> > > mixing_mode='plain'
> > > mixing_beta=0.1
> > > conv_thr=1.0d-8
> > > electron_maxstep=100
> > >
> > > /
> > > &ions
> > > /
> > >
> > > ATOMIC_SPECIES
> > > O 15.999 O_pbe.van.UPF
> > > N 14.007 N.pbe-van_bm.UPF
> > > C 12.011 C_pbe.van.UPF
> > > H 1.008 H_pbe.van.UPF
> > > ATOMIC_POSITIONS {angstrom}
> > > C 4.815369179 12.355337788 8.111406911
> > > C 5.639537337 12.072210478 7.018248617
> > > C 6.373883049 10.886794669 6.974735758
> > > H 5.707874252 12.778745273 6.179910928
> > > C 4.734413944 11.441350355 9.166316558
> > > H 4.235443595 13.287281698 8.140567718
> > > C 6.304598307 9.977077773 8.041477142
> > > H 7.012644682 10.659891408 6.111132336
> > > C 5.477180541 10.260422385 9.138835842
> > > H 4.092409998 11.653694694 10.031778418
> > > H 5.418528381 9.546881383 9.971310698
> > > N 7.058612774 8.759574945 8.006208499
> > > C 6.384981399 7.544139013 8.340645249
> > > C 6.997532612 6.588483316 9.168188787
> > > C 5.084708421 7.308024697 7.864810575
> > > C 6.325550737 5.410241765 9.493833204
> > > H 8.006262126 6.776794433 9.557919083
> > > C 4.414663626 6.134355690 8.210976959
> > > H 4.597637090 8.055249046 7.224770074
> > > C 5.030975670 5.176070562 9.020776666
> > > H 6.819890970 4.670618768 10.138154855
> > > H 3.397721512 5.964689741 7.832306200
> > > H 4.503298572 4.249946635 9.284425745
> > > C 8.412602212 8.773905175 7.652414992
> > > C 9.197305040 9.938168667 7.841458619
> > > C 9.043381168 7.634703664 7.098599788
> > > C 10.533008285 9.972397555 7.486007356
> > > H 8.740413757 10.828552107 8.290447985
> > > C 10.383506998 7.674400214 6.758021800
> > > H 8.466388332 6.717306584 6.931252215
> > > C 11.175184928 8.838234071 6.927523312
> > > H 11.098162573 10.894629696 7.663657304
> > > H 10.849606517 6.778483121 6.322529487
> > > C 12.554045113 8.768090174 6.529797787
> > > C 13.538745611 9.729179498 6.474718127
> > > H 12.882286114 7.769870632 6.203237321
> > > C 13.338246843 11.096686263 6.810664645
> > > N 13.160471613 12.223162736 7.083088078
> > > C 14.914360413 9.407055683 6.034105289
> > > O 15.832284936 10.221452163 5.956798921
> > > O 15.091537629 8.085358800 5.710801225
> > > H 16.043983143 8.016066678 5.436328923
> > > K_POINTS {gamma}
> > >
> > > And here are the turbo_lanczos.x and turbo davidson.x input files
> > >
> > > lanczos
> > >
> > > &lr_input
> > >
> > > prefix="l0-5.3.0",
> > > outdir='/state/partition1/mattioligi/34339',
> > > wfcdir='/state/partition1/mattioligi/34339',
> > > restart_step=6,
> > > restart=.false.
> > >
> > > /
> > > &lr_control
> > >
> > > itermax=12,
> > > ipol=4,
> > >
> > > /
> > >
> > > davidson
> > >
> > > &lr_input
> > >
> > > prefix="l0-5.3.0",
> > > outdir='/state/partition1/mattioligi/34340',
> > > restart=.false.
> > >
> > > /
> > > &lr_dav
> > >
> > > num_eign=2
> > > num_init=4
> > > num_basis_max=10
> > > residue_conv_thr=1.0E-4
> > > start=0.1
> > > finish=1.5
> > > step=0.0002
> > > broadening=0.005
> > > reference=0.2
> > > p_nbnd_occ=5
> > > p_nbnd_virt=5
> > > poor_of_ram=.false.
> > > poor_of_ram2=.false.
> > >
> > > /
> > >
> > > In both cases and on both machines the CRASH report is something like
> > >
> > > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> > >
> > > task # 1
> > > from davcio : error # 20
> > > error while writing from file "/state/partition1/mattioligi/34340/l0-5.3.0.d0psi.32"
> > >
> > > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> > >
> > > I suppose that it is some kind of I/O error, but I warmly require your opinion...:-)
> > > Thank you in advance
> > > Giuseppe
> > >
> > > ********************************************************
> > > - Article premier - Les hommes naissent et demeurent
> > > libres et égaux en droits. Les distinctions sociales
> > > ne peuvent être fondées que sur l'utilité commune
> > > - Article 2 - Le but de toute association politique
> > > est la conservation des droits naturels et
> > > imprescriptibles de l'homme. Ces droits sont la liberté,
> > > la propriété, la sûreté et la résistance à l'oppression.
> > > ********************************************************
> > >
> > > Giuseppe Mattioli
> > > CNR - ISTITUTO DI STRUTTURA DELLA MATERIA
> > > v. Salaria Km 29,300 - C.P. 10
> > > I 00015 - Monterotondo Stazione (RM), Italy
> > > Tel + 39 06 90672836 - Fax +39 06 90672316
> > > E-mail: <giuseppe.mattioli at ism.cnr.it>
> > > http://www.ism.cnr.it/english/staff/mattiolig
> > > ResearcherID: F-6308-2012
> > >
> > > _______________________________________________
> > > Pw_forum mailing list
> > > Pw_forum at pwscf.org
> > > http://pwscf.org/mailman/listinfo/pw_forum
> >
> > ********************************************************
> > - Article premier - Les hommes naissent et demeurent
> > libres et égaux en droits. Les distinctions sociales
> > ne peuvent être fondées que sur l'utilité commune
> > - Article 2 - Le but de toute association politique
> > est la conservation des droits naturels et
> > imprescriptibles de l'homme. Ces droits sont la liberté,
> > la propriété, la sûreté et la résistance à l'oppression.
> > ********************************************************
> >
> > Giuseppe Mattioli
> > CNR - ISTITUTO DI STRUTTURA DELLA MATERIA
> > v. Salaria Km 29,300 - C.P. 10
> > I 00015 - Monterotondo Stazione (RM), Italy
> > Tel + 39 06 90672836 - Fax +39 06 90672316
> > E-mail: <giuseppe.mattioli at ism.cnr.it>
> > http://www.ism.cnr.it/english/staff/mattiolig
> > ResearcherID: F-6308-2012
>
> ********************************************************
> - Article premier - Les hommes naissent et demeurent
> libres et égaux en droits. Les distinctions sociales
> ne peuvent être fondées que sur l'utilité commune
> - Article 2 - Le but de toute association politique
> est la conservation des droits naturels et
> imprescriptibles de l'homme. Ces droits sont la liberté,
> la propriété, la sûreté et la résistance à l'oppression.
> ********************************************************
>
> Giuseppe Mattioli
> CNR - ISTITUTO DI STRUTTURA DELLA MATERIA
> v. Salaria Km 29,300 - C.P. 10
> I 00015 - Monterotondo Stazione (RM), Italy
> Tel + 39 06 90672836 - Fax +39 06 90672316
> E-mail: <giuseppe.mattioli at ism.cnr.it>
> http://www.ism.cnr.it/english/staff/mattiolig
> ResearcherID: F-6308-2012
********************************************************
- Article premier - Les hommes naissent et demeurent
libres et égaux en droits. Les distinctions sociales
ne peuvent être fondées que sur l'utilité commune
- Article 2 - Le but de toute association politique
est la conservation des droits naturels et
imprescriptibles de l'homme. Ces droits sont la liberté,
la propriété, la sûreté et la résistance à l'oppression.
********************************************************
Giuseppe Mattioli
CNR - ISTITUTO DI STRUTTURA DELLA MATERIA
v. Salaria Km 29,300 - C.P. 10
I 00015 - Monterotondo Stazione (RM), Italy
Tel + 39 06 90672836 - Fax +39 06 90672316
E-mail: <giuseppe.mattioli at ism.cnr.it>
http://www.ism.cnr.it/english/staff/mattiolig
ResearcherID: F-6308-2012
More information about the users
mailing list