[Q-e-developers] bug in phonon code?

Giovanni Pizzi giovanni.pizzi at epfl.ch
Tue Jan 15 00:50:46 CET 2013


Dear QE developers,
I believe that I managed to narrow down the problem of my mail of last month.
A detailed description follows below.
The short summary is that I believe that there is a subtle case that may occur when restarting (see below), and when it occurs, wrong data is stored in the dynamical matrix.
To fix it, probably someone who is expert of the internals of the phonon code is needed… (Andrea Dal Corso?)


Description of the problem:
* the summary of my mail of last month is the following: there was a phonon calculation of a cubic system, but for which no symmetries were recognized by QE. A 2x2x2 full q-grid was therefore calculated; however for one of the q points we were getting wrong results (some negative and some very large frequencies), and in particular very different from those expected from symmetry arguments from the equivalent q points
* After my last mail, I tried to rerun the calculation with same machine, executable, and number of nodes, but only for the 'wrong' q-point. The results were this time correct (also from comparison from the other q points).
* I thus compared the dynamical matrices of the two runs from the two matdyn files. It turned out that (apart small numerical differences) all differences were only for atom 37 (don't consider the units of the color map; to get an idea of the magnitude of this problem, take into account that, in the unit of the matdyn file, there were some matrix elements of approx. 1. instead of values of approx 0.01).
* I then checked the restarts in the 'wrong' run. Indeed, there was a restart for that specific q point, at representation #111 (note that 111=37*3). The end of the first run was:

     Representation #111 mode # 111

     Self-consistent Calculation

      iter #   1 total cpu time : 81737.1 secs   av.it<http://av.it>.:   8.9
      thresh= 0.100E-01 alpha_mix =  0.700 |ddv_scf|^2 =  0.305E-06

      iter #   2 total cpu time : 81800.9 secs   av.it<http://av.it>.:  21.4
      thresh= 0.552E-04 alpha_mix =  0.700 |ddv_scf|^2 =  0.127E-05

      iter #   3 total cpu time : 81861.0 secs   av.it<http://av.it>.:  19.2
      thresh= 0.113E-03 alpha_mix =  0.700 |ddv_scf|^2 =  0.975E-07

[…]


      iter #  17 total cpu time : 82715.5 secs   av.it<http://av.it>.:  19.6
      thresh= 0.589E-08 alpha_mix =  0.700 |ddv_scf|^2 =  0.118E-14

      iter #  18 total cpu time : 82776.1 secs   av.it<http://av.it>.:  19.6
      thresh= 0.343E-08 alpha_mix =  0.700 |ddv_scf|^2 =  0.304E-15

      iter #  19 total cpu time : 82837.2 secs   av.it<http://av.it>.:  19.9
      thresh= 0.174E-08 alpha_mix =  0.700 |ddv_scf|^2 =  0.596E-16

     Maximum CPU time exceeded

     max_seconds     =   82800.00
     elapsed seconds =   82814.09


while at the beginning of the restart we got (I just report the relevant parts of the output)
[…]
     Representation   108      1 modes -  Done

     Representation   109      1 modes -  Done

     Representation   110      1 modes -  Done

     Representation   111      1 modes - To be done

     Representation   112      1 modes - To be done

     Representation   113      1 modes - To be done

[…]

     Representation #111 mode # 111

     Self-consistent Calculation

     End of self-consistent calculation

     Convergence has been achieved

i.e., not even a step was performed. My guess is that the convergence was achieved in the previous run, but the code did not have the time to realize it before stopping. At the restart, it immediately realized that convergence was already achieved and it stopped, but it apparently stored wrong data.
I thus think that one should check this part of the restart to understand why this can be the cause of a problem.

As a further hint that this may be the problem, I looked in the database of all our calculations for other phonon calculations with a similar behavior, i.e., the end of the self-consistent calculation without even a single step.
It turned out that there was another calculation with this property, and also that one had a similar behavior: wrong frequencies at the q-point for which the restart happened (see the resulting interpolated band plot 'band_plot_wrong.pdf': since one point got very negative values + very large values, the interpolation goes crazy).

I hope that this is of some help for solving the problem… I wait for comments!

Best,

Giovanni




[cid:2860e64d-21ac-4d9b-b69a-9434b3d70f00 at intranet.epfl.ch]

--
Giovanni Pizzi
Post-doctoral Research Scientist
EPFL STI IMX THEOS
MXC 319 (Bâtiment MXC)
Station 12
CH-1015 Lausanne (Switzerland)
Phone: +41 21 69 31159



On 21 Dec 2012, at 2:22 PM, Stefano de Gironcoli wrote:

Dear Giovanni and Andrea,

    ok.. i dont know what the problem could be but let´s see if it is
possible to pin it down ...

    when there is no symmetry the code should do only trivial summation
and so if k points are equivalent by symmetry they should be equivalent
contribution so it is strange...
     one possibility is that there is a problem in recovering the
calculation from interrupted run...
     this should /might change if you repeat the calculation with
different number of  processors or aon a different machine... have you
done that ?
    your cell is not the smallest possibly one.. i guess a calculation
with just 5 atom per cell and a k point mesh doubled in each direction
should be equivalent to the one you have been doing...
    but should be such that maybe the entire calculation fit in one
single run ....
    have you tried ? is the result the same or it respect the expected
symmetry ?

stefano

On 12/20/2012 05:29 PM, Giovanni Pizzi wrote:
Dear Stefano,
thanks a lot for your reply.

As a follow up on our previous email (later below you can find our answers):
* the first of the two problems (i.e. the code stopping with Error in
routine set_irr_sym (1422): wrong representation) seems to have been
corrected in the most recent SVN, that doesn't stop anymore. Maybe
someone can confirm that modifications to the code in this direction
have been done.
* the problem of different phonon frequencies for the x, y, and z
directions remains. Just to be sure that it wasn't a problem with the
cluster (say I/O, communication, etc.) we have submitted again the
calculation for the 'tricky' q point (q #5), both with 5.0.1 and the svn
version. We will report back when the calculation finishes...

Below follow our answers.

On 12/19/2012 11:04 AM, Stefano de Gironcoli wrote:
Dear Giovanni, dear Andrea

     I'm looking into your calculations...
     in particular 01_vcrelax and 02_vcrelax aida.out
     inspecting the structure with xcrysdens it does not look to me
that the structure has cubic symmetry... or at least Ti atoms appear
not to be aligned in neighboring cells...
Indeed. The Ti atoms are not all aligned in our system. Nonetheless, the
system still has cubic symmetry. To have a rough idea of why this is
happening, it is because 4 of the Ti atoms displace toward a given Ba
atom w.r.t. a perfect cubic perovskite structure, remaining however at
the edge of a tetrahedron (the other four Ti atoms displace in opposite
directions, forming tetrahedra around Ba atoms in neighboring cells).
     in the spglib code you use to recognize the symmetry is there a
threshold in the acceptance of the symmetry operations ?
Yes, there is. The symmetry of our initial system (input of 01_) is
exactly cubic (I-43M, group 217) down to a precision of 10^-12.
After relaxation, the symmetry remains the same (cubic, group 217) with
a precision down to 10^-6; for tighter thresholds, the symmetry is
recognized as rhombohedral (R3M, group 160). This is because pw.x finds
only a subset of the symmetries of the cubic group, and only those are
exactly preserved during the vc-relax. Anyhow, also in the rhombohedral
symmetry the three axes x, y, z can be interchanged and thus we would
expect the same physical properties along x, y, z. (We have also checked
explicitly the coordinates of the output cell of calculation 02_, and
the system obeys the symmetry x<->y<->z exactly, for all digits printed
out in the output file).

     did you relaxed the structure after translating it ? if there is
some symmetry breaking that want to  develop now that you removed the
symmetry you should allow for it.
No, we actually didn't (also because without symmetries the system will
relax to the rhombohedral phase, stable at T=0).
Anyway, we believe that this is not the main issue of the problem we are
facing: if the cubic phase is unstable, we may get negative phonon
frequencies, e.g. at Gamma, but this is not a problem.
However, in any case, even if QE doesn't recognize the symmetries, we
know that the system has cubic or at least rhombohedral symmetry [we
just shifted the system from one with cubic or rhombohedral
coordinates], and thus we expect that the phonon frequencies that we
calculate for a k point along k_x should have the same frequencies of
those calculated along k_y or k_z, but this is not happening.
     are you sure the system insulating... ? if you mistreat a metal as
an insulator results are unpredictable.
We re-checked now and the system is indeed insulating (gap almost 2 eV).

Best,
Andrea and Giovanni
stefano


Quoting Giovanni Pizzi <giovanni.pizzi at epfl.ch<mailto:giovanni.pizzi at epfl.ch>>:

Dear developers,
we are experiencing a problem with the phonon code. We think that
this may be a serious bug.
Here a short description of the problem; below, a more detailed
description follows. Attached, all relevant inputs and outputs of
the 5 calculations of interest.

Short description:
* We have a 40-atom cell with cubic cell (and cubic space group).
pw.x recognizes 6 symmetries (folder 01_... for the vc-relax, and
02_... for the restart from that vc-relax). However, then when ph.x
starts (folder 03_...), it crashes with "Error in routine
set_irr_sym (1422): wrong representation".
* As a workaround, we slightly translate the system (folder 04_...).
Then, pw.x just finds one symmetry (the identity), and subsequently
ph.x starts and computes the phonons. We ask for the phonons on a
2x2x2 q-mesh, and since no symmetries are found, all 8 qpoints are
evaluated (folder 05_...).
* Since the system has cubic symmetry (even if pw/ph don't detect
it), we expect that the frequencies found on equivalent q-points
should be the same. This should be the case for the following q
points (see folder 05_...):
        N         xq(1)         xq(2)         xq(3)
        2   0.000000000   0.000000000  -0.061601673
        3   0.000000000  -0.061601673   0.000000000
        5  -0.061601673   0.000000000   0.000000000

Instead, while the frequencies of qpoints 2 and 3 are similar within
1 cm^-1, the frequencies at qpoint 5 are quite off (there is a
negative frequency -928.641695 [cm-1], and also a new maximum-energy
frequency of 4496.358033 [cm-1], while q points 2 and 3 have
frequencies in the range 70-740 [cm-1].

Could anyone help us in finding where the bug lies?

Best,
Andrea Cepellotti and Giovanni Pizzi
THEOS (Theory and Simulation of Materials), EPFL, Switzerland


P.S.: longer description of what we did with some more details:
the attached zip has 5 folders, starting with an incremental number
01_, 02_, ...
* The first calculation (01_) is a vc-relax of the cell.
* The second calculation (02_) is a vc-relax, restarted from the
coordinates of 01_ (the volume changed significantly during the
vc-relax and we had to restart the calculation with the
newly-calculated G-vectors)
Both these cells have a cubic cell (defined with ibrav=0) and a
cubic space group: I-43m (217), checked using the 'spglib' code.
* we start a phonon calculation (03_) from the results of 02_: the
calculation crashes as described above.
* we take the relaxed coordinates of job 02_ and translate them by
the vector (0.01, 0.02, 0.03) angstrom. With these new coordinates,
we perform an scf calculation (job 04_).
Note that also this system is correctly recognized by the spglib
code to have the same cubic space group of before: I-43m (217)
* we perform a phonon calculation (job 05_) from the scf of job 04_.
Due to queue limitations on our supercomputer, the calculation was
run at steps of 23 hours (and stopped using the max_seconds flag)
and then restarted multiple times. The output file of the first run
is called 'aida.out', the subsequent restarts are called
'aida-recover-N.out' with N ranging from 1 to 12.

_______________________________________________
Q-e-developers mailing list
Q-e-developers at qe-forge.org<mailto:Q-e-developers at qe-forge.org>
http://qe-forge.org/mailman/listinfo/q-e-developers


_______________________________________________
Q-e-developers mailing list
Q-e-developers at qe-forge.org<mailto:Q-e-developers at qe-forge.org>
http://qe-forge.org/mailman/listinfo/q-e-developers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20130114/4a73243a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: difference_in_dynmatrices_459_776.png
Type: image/png
Size: 41026 bytes
Desc: difference_in_dynmatrices_459_776.png
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20130114/4a73243a/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: band_plot_wrong.pdf
Type: application/pdf
Size: 40444 bytes
Desc: band_plot_wrong.pdf
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20130114/4a73243a/attachment.pdf>


More information about the developers mailing list