[Pw_forum] problem with DFT+U

Sergi Vela sergi.vela at gmail.com
Thu Dec 1 16:48:12 CET 2016


Dear Paolo,

I have some more details on the problem with DFT+U. The problem arises from
underflows somewhere in QE code, hence the MPI_Bcast message described in
previous emails. A systematic crash occurs for the attached input, at
least, in versions 5.1.1, 5.2, 5.4 and 6.0.

According to the support team of HPC-GRNET, the problem is not related to
MPI (no matter if IntelMPI or OpenMPI - various versions for both) and it
is not related to BLAS libraries (MKL, OpenBLAS). For Intel compilers, the
flag "-fp-model precise" seems to be necessary (at least for 5.2 and 5.4).
In turn, GNU compilers work. They also notice the underflow (a message
appears in the job file after completion), but it seems that they can
handle them.

The attached input is just an example. Many other jobs of different systems
have failed whereas other closely-related inputs have run without any
problem. I have the impression that the underflow is not always occurring
or, at least, is not always enough to crash the job.

Right now I'm extensively using version 5.1.1 compiled with GNU/4.9
compiler and it seems to work well.

That's all the info I can give you about the problem. I hope it may
eventually help.

Bests,
Sergi



2016-11-23 16:13 GMT+01:00 Sergi Vela <sergi.vela at gmail.com>:

> Dear Paolo,
>
> Unfortunately, there's not much to report so far. Many "relax" jobs for a
> system of ca. 500 atoms (including Fe) fail giving the same message Davide
> reported long time ago:
> _________________
>
> Fatal error in PMPI_Bcast: Other MPI error, error stack:
> PMPI_Bcast(2434)........: MPI_Bcast(buf=0x8b25e30, count=7220,
> MPI_DOUBLE_PRECISION, root=0, comm=0x84000007) failed
> MPIR_Bcast_impl(1807)...:
> MPIR_Bcast(1835)........:
> I_MPIR_Bcast_intra(2016): Failure during collective
> MPIR_Bcast_intra(1665)..: Failure during collective
> _________________
>
> It only occurs in some architectures. The same inputs work for me in 2
> other machines, so it seems to be related to the compilation. The support
> team of the HPC center I'm working on is trying to identify the problem.
> Also, it seems to occur randomly. In the sense that for some DFT+U
> calculations of the same type (same cutoffs, pp's, system) there is no
> problem at all.
>
> I'll try to be more helpful next time, and I'll keep you updated.
>
> Bests,
> Sergi
>
> 2016-11-23 15:21 GMT+01:00 Paolo Giannozzi <p.giannozzi at gmail.com>:
>
>> Thank you, but unless an example demonstrating the problem is provided,
>> or at least some information on where this message come from is supplied,
>> there is close to nothing that can be done
>>
>> Paolo
>>
>> On Wed, Nov 23, 2016 at 10:05 AM, Sergi Vela <sergi.vela at gmail.com>
>> wrote:
>>
>>> Dear Colleagues,
>>>
>>> Just to report that I'm having exactly the same problem with DFT+U. The
>>> same message is appearing randomly only when I use the Hubbard term. I
>>> could test versions 5.2 and 6.0 and it occurs in both.
>>>
>>> All my best,
>>> Sergi
>>>
>>> 2015-07-16 18:43 GMT+02:00 Paolo Giannozzi <p.giannozzi at gmail.com>:
>>>
>>>> There are many well-known problems of DFT+U, but none that is known to
>>>> crash jobs with an obscure message.
>>>>
>>>> Rank 21 [Thu Jul 16 15:51:04 2015] [c4-2c0s15n2] Fatal error in
>>>>> PMPI_Bcast: Message truncated, error stack:
>>>>> PMPI_Bcast(1615)..................: MPI_Bcast(buf=0x75265e0,
>>>>> count=160, MPI_DOUBLE_PRECISION, root=0, comm=0xc4000000) failed
>>>>>
>>>>
>>>> this signals a mismatch between what is sent and what is received in a
>>>> broadcast operation. This may be due to an obvious bug, that however should
>>>> show up at the first iteration, not after XX. Apart compiler or MPI library
>>>> bugs, another reason is the one described in sec.8.3 of the developer
>>>> manual: different processes following a different execution paths. From
>>>> time to time, cases like this are found  (the latest occurrence, in band
>>>> parallelization of exact exchange) and easily fixed. Unfortunately, finding
>>>> them (that is: where this happens) typically requires a painstaking
>>>> parallel debugging.
>>>>
>>>> Paolo
>>>> --
>>>> Paolo Giannozzi, Dept. Chemistry&Physics&Environment,
>>>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
>>>> Phone +39-0432-558216, fax +39-0432-558222
>>>>
>>>> _______________________________________________
>>>> Pw_forum mailing list
>>>> Pw_forum at pwscf.org
>>>> http://pwscf.org/mailman/listinfo/pw_forum
>>>>
>>>
>>>
>>> _______________________________________________
>>> Pw_forum mailing list
>>> Pw_forum at pwscf.org
>>> http://pwscf.org/mailman/listinfo/pw_forum
>>>
>>
>>
>>
>> --
>> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
>> Phone +39-0432-558216, fax +39-0432-558222
>>
>>
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://pwscf.org/mailman/listinfo/pw_forum
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20161201/335bce8d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Not_Working.input
Type: application/octet-stream
Size: 27650 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20161201/335bce8d/attachment.obj>


More information about the users mailing list