[Pw_forum] problem with DFT+U

Sergi Vela sergi.vela at gmail.com
Fri Dec 2 09:47:55 CET 2016


Yes, I agree that it is an issue related to the compilation. I said that in
my 2nd email of the 23rd of November. Still, I think it's worth having this
problem reported and "solved" (at least in practice) in the forum. I
suffered this error in an HPC center, compiled by specialists so other
users may experience it as well.

2016-12-01 18:43 GMT+01:00 Paolo Giannozzi <p.giannozzi at gmail.com>:

> "underflows"? They should never be a problem, unless you instruct the
> compiler (by activating some obscure flag) to catch them.
>
> Paolo
>
> On Thu, Dec 1, 2016 at 4:48 PM, Sergi Vela <sergi.vela at gmail.com> wrote:
>
>> Dear Paolo,
>>
>> I have some more details on the problem with DFT+U. The problem arises
>> from underflows somewhere in QE code, hence the MPI_Bcast message described
>> in previous emails. A systematic crash occurs for the attached input, at
>> least, in versions 5.1.1, 5.2, 5.4 and 6.0.
>>
>> According to the support team of HPC-GRNET, the problem is not related to
>> MPI (no matter if IntelMPI or OpenMPI - various versions for both) and it
>> is not related to BLAS libraries (MKL, OpenBLAS). For Intel compilers, the
>> flag "-fp-model precise" seems to be necessary (at least for 5.2 and 5.4).
>> In turn, GNU compilers work. They also notice the underflow (a message
>> appears in the job file after completion), but it seems that they can
>> handle them.
>>
>> The attached input is just an example. Many other jobs of different
>> systems have failed whereas other closely-related inputs have run without
>> any problem. I have the impression that the underflow is not always
>> occurring or, at least, is not always enough to crash the job.
>>
>> Right now I'm extensively using version 5.1.1 compiled with GNU/4.9
>> compiler and it seems to work well.
>>
>> That's all the info I can give you about the problem. I hope it may
>> eventually help.
>>
>> Bests,
>> Sergi
>>
>>
>>
>> 2016-11-23 16:13 GMT+01:00 Sergi Vela <sergi.vela at gmail.com>:
>>
>>> Dear Paolo,
>>>
>>> Unfortunately, there's not much to report so far. Many "relax" jobs for
>>> a system of ca. 500 atoms (including Fe) fail giving the same message
>>> Davide reported long time ago:
>>> _________________
>>>
>>> Fatal error in PMPI_Bcast: Other MPI error, error stack:
>>> PMPI_Bcast(2434)........: MPI_Bcast(buf=0x8b25e30, count=7220,
>>> MPI_DOUBLE_PRECISION, root=0, comm=0x84000007) failed
>>> MPIR_Bcast_impl(1807)...:
>>> MPIR_Bcast(1835)........:
>>> I_MPIR_Bcast_intra(2016): Failure during collective
>>> MPIR_Bcast_intra(1665)..: Failure during collective
>>> _________________
>>>
>>> It only occurs in some architectures. The same inputs work for me in 2
>>> other machines, so it seems to be related to the compilation. The support
>>> team of the HPC center I'm working on is trying to identify the problem.
>>> Also, it seems to occur randomly. In the sense that for some DFT+U
>>> calculations of the same type (same cutoffs, pp's, system) there is no
>>> problem at all.
>>>
>>> I'll try to be more helpful next time, and I'll keep you updated.
>>>
>>> Bests,
>>> Sergi
>>>
>>> 2016-11-23 15:21 GMT+01:00 Paolo Giannozzi <p.giannozzi at gmail.com>:
>>>
>>>> Thank you, but unless an example demonstrating the problem is provided,
>>>> or at least some information on where this message come from is supplied,
>>>> there is close to nothing that can be done
>>>>
>>>> Paolo
>>>>
>>>> On Wed, Nov 23, 2016 at 10:05 AM, Sergi Vela <sergi.vela at gmail.com>
>>>> wrote:
>>>>
>>>>> Dear Colleagues,
>>>>>
>>>>> Just to report that I'm having exactly the same problem with DFT+U.
>>>>> The same message is appearing randomly only when I use the Hubbard term. I
>>>>> could test versions 5.2 and 6.0 and it occurs in both.
>>>>>
>>>>> All my best,
>>>>> Sergi
>>>>>
>>>>> 2015-07-16 18:43 GMT+02:00 Paolo Giannozzi <p.giannozzi at gmail.com>:
>>>>>
>>>>>> There are many well-known problems of DFT+U, but none that is known
>>>>>> to crash jobs with an obscure message.
>>>>>>
>>>>>> Rank 21 [Thu Jul 16 15:51:04 2015] [c4-2c0s15n2] Fatal error in
>>>>>>> PMPI_Bcast: Message truncated, error stack:
>>>>>>> PMPI_Bcast(1615)..................: MPI_Bcast(buf=0x75265e0,
>>>>>>> count=160, MPI_DOUBLE_PRECISION, root=0, comm=0xc4000000) failed
>>>>>>>
>>>>>>
>>>>>> this signals a mismatch between what is sent and what is received in
>>>>>> a broadcast operation. This may be due to an obvious bug, that however
>>>>>> should show up at the first iteration, not after XX. Apart compiler or MPI
>>>>>> library bugs, another reason is the one described in sec.8.3 of the
>>>>>> developer manual: different processes following a different execution
>>>>>> paths. From time to time, cases like this are found  (the latest
>>>>>> occurrence, in band parallelization of exact exchange) and easily fixed.
>>>>>> Unfortunately, finding them (that is: where this happens) typically
>>>>>> requires a painstaking parallel debugging.
>>>>>>
>>>>>> Paolo
>>>>>> --
>>>>>> Paolo Giannozzi, Dept. Chemistry&Physics&Environment,
>>>>>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
>>>>>> Phone +39-0432-558216, fax +39-0432-558222
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pw_forum mailing list
>>>>>> Pw_forum at pwscf.org
>>>>>> http://pwscf.org/mailman/listinfo/pw_forum
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pw_forum mailing list
>>>>> Pw_forum at pwscf.org
>>>>> http://pwscf.org/mailman/listinfo/pw_forum
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
>>>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
>>>> Phone +39-0432-558216, fax +39-0432-558222
>>>>
>>>>
>>>> _______________________________________________
>>>> Pw_forum mailing list
>>>> Pw_forum at pwscf.org
>>>> http://pwscf.org/mailman/listinfo/pw_forum
>>>>
>>>
>>>
>>
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://pwscf.org/mailman/listinfo/pw_forum
>>
>
>
>
> --
> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> Phone +39-0432-558216 <+39%200432%20558216>, fax +39-0432-558222
> <+39%200432%20558222>
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20161202/7e904de7/attachment.html>


More information about the users mailing list