[Pw_forum] possible i/o bug in turbo_lanczos.x and turbo_davidson.x 5.3.0
Timrov Iurii
iurii.timrov at epfl.ch
Fri Feb 5 14:07:22 CET 2016
Dear Giuseppe,
I am going to check if there is some extra allocations and/or a memory leak, when I have some time.
> Indeed!!! Only microsoft windows requires more memory to do the same thing in newer versions!!!!
Really? And there is no problem on Linux?
Best regards,
Iurii
--
Dr. Iurii Timrov
Postdoctoral Researcher
École Polytechnique Fédérale de Lausanne,
Theory and Simulation of Materials
CH-1015 Lausanne, Switzerland
+41 21 69 34 881
http://people.epfl.ch/265334
________________________________________
From: Giuseppe Mattioli <giuseppe.mattioli at ism.cnr.it>
Sent: Friday, February 5, 2016 1:55 PM
To: Timrov Iurii
Cc: pw_forum at pwscf.org
Subject: Re: Re: Re: [Pw_forum] possible i/o bug in turbo_lanczos.x and turbo_davidson.x 5.3.0
Dear Iurii
> It seems that this is a RAM issue.
Maybe something connected with memory allocation which is changed between 5.1.1 and 5.2.1
> I runned your test case with QE-5.2.1 on my local workstation with 8 cores and 64 Gb RAM and the Lanczos code crashed. When I changed the input of
> PWscf so that only the occupied states are computed (actually, the empty states are not need in the Lanczos calculation), which of course decreased
> RAM requirements, the code didn't crash.
the code didn't crash even on my 8 cores 16 GB RAM cluster with 5.1.1. And it is not a large benchmark. I used to run larger ones on the same node and
far larger ones on two nodes of the same machine with older versions. The problem cannot be due to the overall memory requirements, but to some
problematic memory allocation (a leak somewhere?).
> If this is indeed the reason of the problem, then it seems strange to me why the QE-5.1.1 does not have this problem. Maybe some investigation would
> be desired.
Indeed!!! Only microsoft windows requires more memory to do the same thing in newer versions!!!!
Best
Giuseppe
On Friday, February 05, 2016 10:07:14 AM Timrov Iurii wrote:
> Dear Giuseppe,
>
> It seems that this is a RAM issue.
>
> I runned your test case with QE-5.2.1 on my local workstation with 8 cores and 64 Gb RAM and the Lanczos code crashed. When I changed the input of
> PWscf so that only the occupied states are computed (actually, the empty states are not need in the Lanczos calculation), which of course decreased
> RAM requirements, the code didn't crash.
>
> In Bluegene machine you may try to optimize RAM too. Maybe you know, one can allocate all cores per node but use only a few of them which would
> allow you to increase RAM per core. Please note that with Davidson the RAM requirements are even much larger, so it is not easy to optimize the
> script for submission the jobs for large systems.
>
> In order to verify if you also have a memory issue, you may try to decrease the value of celldm(1), cutoffs etc. and see if the code does not crash.
>
> If this is indeed the reason of the problem, then it seems strange to me why the QE-5.1.1 does not have this problem. Maybe some investigation would
> be desired.
>
> HTH
>
> Best regards,
> Iurii
>
> --
> Dr. Iurii Timrov
> Postdoctoral Researcher
> École Polytechnique Fédérale de Lausanne,
> Theory and Simulation of Materials
> CH-1015 Lausanne, Switzerland
> +41 21 69 34 881
> http://people.epfl.ch/265334
>
> ________________________________________
> From: Giuseppe Mattioli <giuseppe.mattioli at ism.cnr.it>
> Sent: Thursday, February 4, 2016 5:59 PM
> To: Timrov Iurii
> Cc: pw_forum at pwscf.org
> Subject: Re: Re: [Pw_forum] possible i/o bug in turbo_lanczos.x and turbo_davidson.x 5.3.0
>
> Silent crash on bluegene with 5.2.1 (I have no time to compile 5.3.0 now. I may try tomorrow if you think it is important).
>
>
> Program turboTDDFT v.5.2.1 starts on 4Feb2016 at 17:56:55
>
> This program is part of the open-source Quantum ESPRESSO suite
> for quantum simulation of materials; please cite
> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
> URL http://www.quantum-espresso.org",
> in publications or presentations arising from this work. More details at
> http://www.quantum-espresso.org/quote
>
> Parallel version (MPI & OpenMP), running on 2048 processor cores
> Number of MPI processes: 512
> Threads/MPI process: 4
> R & G space division: proc/nbgrp/npool/nimage = 512
>
> Reading data from directory:
> /gpfs/scratch/userexternal/gmattiol/test/tddft/run/tmp/././l0-5.3.0.save
>
> Info: using nr1, nr2, nr3 values from input
>
> Info: using nr1, nr2, nr3 values from input
>
> IMPORTANT: XC functional enforced from input :
> Exchange-correlation = SLA PW PBE PBE ( 1 4 3 4 0 0)
> Any further DFT definition will be discarded
> Please, verify this is what you really want
>
>
> Parallelization info
> --------------------
> sticks: dense smooth PW G-vecs: dense smooth PW
> Min 78 38 8 12054 4220 492
> Max 80 40 10 12104 4300 550
> Sum 40733 20369 5097 6186431 2186841 273425
> Tot 20367 10185 2549
>
>
> negative rho (up, down): 9.597E-02 0.000E+00
>
> Subspace diagonalization in iterative solution of the eigenvalue problem:
> scalapack distributed-memory algorithm (size of sub-group: 16* 16 procs)
>
>
> Warning: There are virtual states in the input file, trying to disregard in response calculation
>
> Ultrasoft (Vanderbilt) Pseudopotentials
>
> Normal read
>
> Gamma point algorithm
> 2016-02-04 17:57:18.063 (WARN ) [0x40000ee8d50] :7014845:ibm.runjob.client.Job: terminated by signal 6
> 2016-02-04 17:57:18.065 (WARN ) [0x40000ee8d50] :7014845:ibm.runjob.client.Job: abnormal termination by signal 6 from rank 295
>
> On Thursday, February 04, 2016 03:46:10 PM Timrov Iurii wrote:
> > Dear Giuseppe,
> >
> > As far as I understand the code crashes when it tries to write the vectors "d0psi" to the disc. First thing to do, I think, is to check that you
> > have enough space on the disc. If this is not the issue, then let's continue looking for a reason.
> >
> > You may want to look in the routine TDDFPT/src/lr_solve_e.f90 at lines 110-138 where the code writes vectors to the disc in parallel. Please make
> > sure that the "outdir" is the same in PWscf and in Lanczos/Davidson (and don't specify wfcdir). If this does not solve the problem, could you
> > report please also the output of Lanczos/Davidson (better Lanczos)?
> >
> > HTH
> >
> > Best regards,
> > Iurii Timrov
> > Post-doctoral researcher
> > THEOS - École Polytechnique Fédérale de Lausanne
> > Lausanne, Switzerland
> >
> >
> > ________________________________________
> > From: pw_forum-bounces at pwscf.org <pw_forum-bounces at pwscf.org> on behalf of Giuseppe Mattioli <giuseppe.mattioli at ism.cnr.it>
> > Sent: Thursday, February 4, 2016 11:34 AM
> > To: pw_forum at pwscf.org
> > Subject: [Pw_forum] possible i/o bug in turbo_lanczos.x and turbo_davidson.x 5.3.0
> >
> > Dear All
> > I'm having problems when performing nontrivial runs of turbo_davidson.x and turbo_lanczos.x with 5.2.1 and 5.3.0 versions of QE.
> > Let me say first that "trivial" runs (CH4 molecule with same pseudopotentials and cutoffs but a smaller 30 a.u.^3 cubic cell) work fine with all
> > the tested versions.
> > However, the input files for a nontrivial case that leads to crash should run on a decent pc in about 1 hr, so they provide a significant but not
> > huge test. *Note* that if I run the same input files with the 5.1.1 version (compiled against the very same environment) everything goes (more
> > slowly but) fine! The 5.3.0 (and 5.2.1) crashes have been reproduced on two different machines (intel 8 cores 16GB RAM, amd 32 cores 64 GB RAM),
> > so
> > they should not be considered as erratic.
> >
> > here is the pw.x run. The PPs are quite old and can be found in the online library (or provided by me on demand).
> >
> > &control
> >
> > calculation = 'scf'
> > restart_mode='from_scratch',
> > prefix='l0-5.3.0',
> > pseudo_dir = '/home/mattioligi/PP_UPF/',
> > outdir='/home/mattioligi/cocat/test_tddft/5.2.1/l0/5.3.0/run/tmp/',
> > nstep=300,
> > max_seconds=80000,
> > disk_io='low',
> > tprnfor=.true.,
> >
> > /
> > &system
> >
> > ibrav=1, celldm(1)=40.000000,
> > nat=42, ntyp=4, nbnd=75,
> > ecutwfc = 40.0,
> > ecutrho = 320.0,
> > nspin=1,
> >
> > /
> > &electrons
> >
> > diagonalization='david',
> > mixing_mode='plain'
> > mixing_beta=0.1
> > conv_thr=1.0d-8
> > electron_maxstep=100
> >
> > /
> > &ions
> > /
> >
> > ATOMIC_SPECIES
> > O 15.999 O_pbe.van.UPF
> > N 14.007 N.pbe-van_bm.UPF
> > C 12.011 C_pbe.van.UPF
> > H 1.008 H_pbe.van.UPF
> > ATOMIC_POSITIONS {angstrom}
> > C 4.815369179 12.355337788 8.111406911
> > C 5.639537337 12.072210478 7.018248617
> > C 6.373883049 10.886794669 6.974735758
> > H 5.707874252 12.778745273 6.179910928
> > C 4.734413944 11.441350355 9.166316558
> > H 4.235443595 13.287281698 8.140567718
> > C 6.304598307 9.977077773 8.041477142
> > H 7.012644682 10.659891408 6.111132336
> > C 5.477180541 10.260422385 9.138835842
> > H 4.092409998 11.653694694 10.031778418
> > H 5.418528381 9.546881383 9.971310698
> > N 7.058612774 8.759574945 8.006208499
> > C 6.384981399 7.544139013 8.340645249
> > C 6.997532612 6.588483316 9.168188787
> > C 5.084708421 7.308024697 7.864810575
> > C 6.325550737 5.410241765 9.493833204
> > H 8.006262126 6.776794433 9.557919083
> > C 4.414663626 6.134355690 8.210976959
> > H 4.597637090 8.055249046 7.224770074
> > C 5.030975670 5.176070562 9.020776666
> > H 6.819890970 4.670618768 10.138154855
> > H 3.397721512 5.964689741 7.832306200
> > H 4.503298572 4.249946635 9.284425745
> > C 8.412602212 8.773905175 7.652414992
> > C 9.197305040 9.938168667 7.841458619
> > C 9.043381168 7.634703664 7.098599788
> > C 10.533008285 9.972397555 7.486007356
> > H 8.740413757 10.828552107 8.290447985
> > C 10.383506998 7.674400214 6.758021800
> > H 8.466388332 6.717306584 6.931252215
> > C 11.175184928 8.838234071 6.927523312
> > H 11.098162573 10.894629696 7.663657304
> > H 10.849606517 6.778483121 6.322529487
> > C 12.554045113 8.768090174 6.529797787
> > C 13.538745611 9.729179498 6.474718127
> > H 12.882286114 7.769870632 6.203237321
> > C 13.338246843 11.096686263 6.810664645
> > N 13.160471613 12.223162736 7.083088078
> > C 14.914360413 9.407055683 6.034105289
> > O 15.832284936 10.221452163 5.956798921
> > O 15.091537629 8.085358800 5.710801225
> > H 16.043983143 8.016066678 5.436328923
> > K_POINTS {gamma}
> >
> > And here are the turbo_lanczos.x and turbo davidson.x input files
> >
> > lanczos
> >
> > &lr_input
> >
> > prefix="l0-5.3.0",
> > outdir='/state/partition1/mattioligi/34339',
> > wfcdir='/state/partition1/mattioligi/34339',
> > restart_step=6,
> > restart=.false.
> >
> > /
> > &lr_control
> >
> > itermax=12,
> > ipol=4,
> >
> > /
> >
> > davidson
> >
> > &lr_input
> >
> > prefix="l0-5.3.0",
> > outdir='/state/partition1/mattioligi/34340',
> > restart=.false.
> >
> > /
> > &lr_dav
> >
> > num_eign=2
> > num_init=4
> > num_basis_max=10
> > residue_conv_thr=1.0E-4
> > start=0.1
> > finish=1.5
> > step=0.0002
> > broadening=0.005
> > reference=0.2
> > p_nbnd_occ=5
> > p_nbnd_virt=5
> > poor_of_ram=.false.
> > poor_of_ram2=.false.
> >
> > /
> >
> > In both cases and on both machines the CRASH report is something like
> >
> > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> >
> > task # 1
> > from davcio : error # 20
> > error while writing from file "/state/partition1/mattioligi/34340/l0-5.3.0.d0psi.32"
> >
> > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> >
> > I suppose that it is some kind of I/O error, but I warmly require your opinion...:-)
> > Thank you in advance
> > Giuseppe
> >
> > ********************************************************
> > - Article premier - Les hommes naissent et demeurent
> > libres et égaux en droits. Les distinctions sociales
> > ne peuvent être fondées que sur l'utilité commune
> > - Article 2 - Le but de toute association politique
> > est la conservation des droits naturels et
> > imprescriptibles de l'homme. Ces droits sont la liberté,
> > la propriété, la sûreté et la résistance à l'oppression.
> > ********************************************************
> >
> > Giuseppe Mattioli
> > CNR - ISTITUTO DI STRUTTURA DELLA MATERIA
> > v. Salaria Km 29,300 - C.P. 10
> > I 00015 - Monterotondo Stazione (RM), Italy
> > Tel + 39 06 90672836 - Fax +39 06 90672316
> > E-mail: <giuseppe.mattioli at ism.cnr.it>
> > http://www.ism.cnr.it/english/staff/mattiolig
> > ResearcherID: F-6308-2012
> >
> > _______________________________________________
> > Pw_forum mailing list
> > Pw_forum at pwscf.org
> > http://pwscf.org/mailman/listinfo/pw_forum
>
> ********************************************************
> - Article premier - Les hommes naissent et demeurent
> libres et égaux en droits. Les distinctions sociales
> ne peuvent être fondées que sur l'utilité commune
> - Article 2 - Le but de toute association politique
> est la conservation des droits naturels et
> imprescriptibles de l'homme. Ces droits sont la liberté,
> la propriété, la sûreté et la résistance à l'oppression.
> ********************************************************
>
> Giuseppe Mattioli
> CNR - ISTITUTO DI STRUTTURA DELLA MATERIA
> v. Salaria Km 29,300 - C.P. 10
> I 00015 - Monterotondo Stazione (RM), Italy
> Tel + 39 06 90672836 - Fax +39 06 90672316
> E-mail: <giuseppe.mattioli at ism.cnr.it>
> http://www.ism.cnr.it/english/staff/mattiolig
> ResearcherID: F-6308-2012
********************************************************
- Article premier - Les hommes naissent et demeurent
libres et égaux en droits. Les distinctions sociales
ne peuvent être fondées que sur l'utilité commune
- Article 2 - Le but de toute association politique
est la conservation des droits naturels et
imprescriptibles de l'homme. Ces droits sont la liberté,
la propriété, la sûreté et la résistance à l'oppression.
********************************************************
Giuseppe Mattioli
CNR - ISTITUTO DI STRUTTURA DELLA MATERIA
v. Salaria Km 29,300 - C.P. 10
I 00015 - Monterotondo Stazione (RM), Italy
Tel + 39 06 90672836 - Fax +39 06 90672316
E-mail: <giuseppe.mattioli at ism.cnr.it>
http://www.ism.cnr.it/english/staff/mattiolig
ResearcherID: F-6308-2012
More information about the users
mailing list