[Q-e-developers] bug image parallelization in PHonon
Thomas Brumme
thomas.brumme at mpsd.mpg.de
Fri Sep 23 16:45:35 CEST 2016
I meanwhile had a discussion with Lorenzo Paulatto about a similar problem.
I think that it might be a rather specific problem. As soon as I
parallelize only over
q points using start_q and last_q there is no problem - also for restarting.
Using images I can, in principle, even create the full dvscf files,
without having to
rerun the calculation without images, using split and cat on the
different dvscf files
in the different temp folders. It's tedious but it works. Yet, in future
I will use only
the parallelization over q points for the calculation of the dvscf.
In summary, the parallelization for PH is not straightforward and I
think that it
might help to store, e.g., the dvscf files for different representations
separately.
But Lorenzo mentioned that system administrators complain if the number of
written files is large... It could be helpful if there would be a kind
of summary
what can be done using images and what not... I.e. dvscf (and el-ph)
does not
work if image parallelization is used, especially if the different
representations
of one q point are split across different images. For el-ph the code
does not
start, but maybe a similar check can be added for the dvscf files?
Well, or maybe not, I don't know :)
On 09/23/2016 04:24 PM, Paolo Giannozzi wrote:
> has anybody any idea? P.
>
> On Wed, Sep 14, 2016 at 1:30 PM, Thomas Brumme
> <thomas.brumme at mpsd.mpg.de <mailto:thomas.brumme at mpsd.mpg.de>> wrote:
>
> Dear all,
>
> I think I found a bug in the image parallelization of PH - or I'm
> doing
> something wrong.
> I used the version 5.4 but the problem is also there if I use the
> 6.0 beta.
> Maybe someone remembers my email few days ago to the normal email list
> concerning
> the parallelization using the GRID technique - the problem I encounter
> here is essentially
> the same. As an example, I use a modified run_example_1 of the
> Recover_example
> directory of PH.
>
> Description of the problem:
>
> 0. (Following the example) I did an scf calculation using 2 CPUs with:
>
> &control
> calculation='scf'
> restart_mode='from_scratch',
> prefix='aluminum',
> pseudo_dir = './',
> outdir='./tempdir/'
> /
> &system
> ibrav= 2, celldm(1) =7.5, nat= 1, ntyp= 1,
> ecutwfc =15.0,
> occupations='smearing', smearing='methfessel-paxton',
> degauss=0.05,
> la2F = .true.,
> /
> &electrons
> conv_thr = 1.0d-8
> mixing_beta = 0.7
> /
> ATOMIC_SPECIES
> Al 26.98 Al.pz-vbc.UPF
> ATOMIC_POSITIONS
> Al 0.00 0.00 0.00
> K_POINTS {automatic}
> 16 16 16 0 0 0
>
>
> 1. I'll do the scf calculation using 2 CPUS and:
>
> &control
> calculation='scf'
> restart_mode='from_scratch',
> prefix='aluminum',
> pseudo_dir = './',
> outdir='./tempdir/'
> /
> &system
> ibrav= 2, celldm(1) =7.5, nat= 1, ntyp= 1,
> ecutwfc =15.0,
> occupations='smearing', smearing='methfessel-paxton',
> degauss=0.05
> /
> &electrons
> conv_thr = 1.0d-8
> mixing_beta = 0.7
> /
> ATOMIC_SPECIES
> Al 26.98 Al.pz-vbc.UPF
> ATOMIC_POSITIONS
> Al 0.00 0.00 0.00
> K_POINTS {automatic}
> 8 8 8 0 0 0
>
>
> 2. I'll do a phonon calculation including storing the dvscf files and
> using images.
> More specifically I used:
>
> mpirun -np 4 ph.x -ni 2 < al.elph.in <http://al.elph.in>
>
> with al.elph.in <http://al.elph.in> given by:
>
> Electron-phonon coefficients for Al
> &inputph
> tr2_ph=1.0d-10,
> prefix='aluminum',
> fildvscf='aldv',
> amass(1)=26.98,
> outdir='./tempdir/',
> fildyn='al.dyn',
> ! electron_phonon='interpolated',
> ! el_ph_sigma=0.005,
> ! el_ph_nsigma=10,
> ! recover=.true.
> ! trans=.false.,
> ldisp=.true.
> max_seconds=6,
> nq1=4, nq2=4, nq3=4
> /
>
> I used max_seconds in order to simulate the finite run time we have on
> our HPC.
> Restarting with recover=.true. works fine... I.e. I used:
>
> Electron-phonon coefficients for Al
> &inputph
> tr2_ph=1.0d-10,
> prefix='aluminum',
> fildvscf='aldv',
> amass(1)=26.98,
> outdir='./tempdir/',
> fildyn='al.dyn',
> ! electron_phonon='interpolated',
> ! el_ph_sigma=0.005,
> ! el_ph_nsigma=10,
> recover=.true.
> ! trans=.false.,
> ldisp=.true.
> max_seconds=6,
> nq1=4, nq2=4, nq3=4
> /
>
>
> 3. Now I want to collect all data using no images:
>
> mpirun -np 2 ph.x < al.elph.in <http://al.elph.in>
>
> with the same input file as given in 2.
>
> I'll get the error "Possibly too few bands at point ..." once the code
> wants to
> recalculate the wave functions for the q points which were calculated
> only on
> the second image, i.e., for q points 6, 7, and 8.
>
> If I check the charge_density.dat files in the subfolders of the q
> points in the
> _ph0 directory I find that they're empty. Thus, I copied the q
> subfolders of the
> second image by hand to the folder of the first image using:
>
> cp -r _ph1/aluminum.q_* _ph0/
>
> If I now restart without images, using the input of 2. it works...
> Everything is fine...
>
>
> 4. Now I can also calculate the el-ph parameters using the input:
>
> Electron-phonon coefficients for Al
> &inputph
> tr2_ph=1.0d-10,
> prefix='aluminum',
> fildvscf='aldv',
> amass(1)=26.98,
> outdir='./tempdir/',
> fildyn='al.dyn',
> electron_phonon='interpolated',
> el_ph_sigma=0.005,
> el_ph_nsigma=10,
> ! recover=.true.
> trans=.false.,
> ldisp=.true.
> ! max_seconds=6,
> nq1=4, nq2=4, nq3=4
> /
>
>
> 5. Another problem I encounter is the following... Suppose the run
> time
> is not enough to
> finish the el-ph calculations, i.e., instead of the input in 4. I use:
>
> Electron-phonon coefficients for Al
> &inputph
> tr2_ph=1.0d-10,
> prefix='aluminum',
> fildvscf='aldv',
> amass(1)=26.98,
> outdir='./tempdir/',
> fildyn='al.dyn',
> electron_phonon='interpolated',
> el_ph_sigma=0.005,
> el_ph_nsigma=10,
> ! recover=.true.
> trans=.false.,
> ldisp=.true.
> max_seconds=6,
> nq1=4, nq2=4, nq3=4
> /
>
> The code will stop at a certain point (in my case the 4th q
> point). If I
> now restart the calculation
> using:
>
> Electron-phonon coefficients for Al
> &inputph
> tr2_ph=1.0d-10,
> prefix='aluminum',
> fildvscf='aldv',
> amass(1)=26.98,
> outdir='./tempdir/',
> fildyn='al.dyn',
> electron_phonon='interpolated',
> el_ph_sigma=0.005,
> el_ph_nsigma=10,
> recover=.true.
> trans=.false.,
> ldisp=.true.
> ! max_seconds=6,
> nq1=4, nq2=4, nq3=4
> /
>
> I get (again) the error message "Possibly too few bands at point ..."
> once the code wants to calculate
> the wave functions for the 4th q point (the one it stopped before)...
> All other points are fine...
>
>
> I think that the whole problem is related to the storing of the wave
> functions and the charge density.
> Maybe I'm doing something really wrong, but I don't see any obvious
> error in the input... Also I don't
> see any input variable for ph which influences the saving of wave
> functions...
>
> Regards
>
> Thomas
>
> --
> Dr. rer. nat. Thomas Brumme
> Max Planck Institute for the Structure and Dynamics of Matter
> Luruper Chaussee 149
> 22761 Hamburg
>
> Tel: +49 (0)40 8998 6557
>
> email: Thomas.Brumme at mpsd.mpg.de <mailto:Thomas.Brumme at mpsd.mpg.de>
>
> _______________________________________________
> Q-e-developers mailing list
> Q-e-developers at qe-forge.org <mailto:Q-e-developers at qe-forge.org>
> http://qe-forge.org/mailman/listinfo/q-e-developers
> <http://qe-forge.org/mailman/listinfo/q-e-developers>
>
>
>
>
> --
> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> Phone +39-0432-558216, fax +39-0432-558222
>
>
>
> _______________________________________________
> Q-e-developers mailing list
> Q-e-developers at qe-forge.org
> http://qe-forge.org/mailman/listinfo/q-e-developers
--
Dr. rer. nat. Thomas Brumme
Max Planck Institute for the Structure and Dynamics of Matter
Luruper Chaussee 149
22761 Hamburg
Tel: +49 (0)40 8998 6557
email: Thomas.Brumme at mpsd.mpg.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20160923/5cb66cfd/attachment.html>
More information about the developers
mailing list