[Pw_forum] Possible bug in QE 5.3.0 band group parallelization
Taylor Barnes
tbarnes at lbl.gov
Wed Jan 27 08:35:59 CET 2016
Hi Paolo,
Thanks very much for this. Just to clarify:
Version 5.2.0, no band parallelization - Works
Version 5.2.0, with band parallelization - Works
Version 5.3.0, new call to mp_start_diag, no band parallelization - Works
Version 5.3.0, old call to mp_start_diag, no band parallelization - Works
Version 5.3.0, new call to mp_start_diag, with band parallelization - fails
with "Duplicate ranks in rank array" errors
Version 5.3.0, old call to mp_start_diag, with band parallelization - fails
with "problems computing cholesky" errors
All of these tests are performed on a Cray XC.
Best,
Taylor
On Tue, Jan 26, 2016 at 3:23 AM, Paolo Giannozzi <p.giannozzi at gmail.com>
wrote:
> Recent changes to the way band parallelization is performed seem to be
> incompatible with Scalapack. The problem is related to the obscure hacks
> needed to convince Scalapack to work in a subgroup of processors. If you
> revert to the previous way of setting linear-algebra parallelization,
> things should work (or not work) as before, so the latter problem you
> mention may have other origins. You should verify if you manage to run
> - with the new version, old call to mp_start_diag, no band parallelization
> - with an old version, with or withou band parallelization
> BEWARE: all versions < 5.3 use an incorrect definition of B3LYP, leading
> to small but non-negligible discrepancies with the results of other codes
>
> Paolo
>
> On Tue, Jan 26, 2016 at 12:53 AM, Taylor Barnes <tbarnes at lbl.gov> wrote:
>
>> Dear All,
>>
>> I have found that calculations involving band group parallelism that
>> worked correctly using QE 5.2.0 produce errors in version 5.3.0 (see below
>> for an example input file). In particular, when I run a PBE0 calculation
>> with either nbgrp or ndiag set to 1, everything runs correctly; however,
>> when I run a calculation with both nbgrp and ndiag set greater than 1, the
>> calculation immediately fails with the following error messages:
>>
>> Rank 48 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n2] Fatal error in
>> PMPI_Group_incl: Invalid rank, error stack:
>> PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36,
>> ranks=0x53a3c80, new_group=0x7fffffff6794) failed
>> MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index
>> 12, has value 0 which is also the value at index 0
>> Rank 93 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n3] Fatal error in
>> PMPI_Group_incl: Invalid rank, error stack:
>> PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36,
>> ranks=0x538fdf0, new_group=0x7fffffff6794) failed
>> MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index
>> 12, has value 0 which is also the value at index 0
>> etc...
>>
>> The error is apparently related to a change in Modules/mp_global.f90
>> on line 80. Here, the line previously read:
>>
>> CALL mp_start_diag ( ndiag_, intra_BGRP_comm )
>>
>> In QE 5.3.0, this has been changed to:
>>
>> CALL mp_start_diag ( ndiag_, intra_POOL_comm )
>>
>> The call using intra_BGRP_comm still exists in version 5.3.0 of the
>> code, but is commented out, and the surrounding comments indicate that it
>> should be possible to switch back to the old parallelization by
>> commenting/uncommenting as desired. When I do this, I find that instead of
>> the error messages described above, I get the following error messages:
>>
>> Error in routine cdiaghg(193):
>> problems computing cholesky
>>
>> Am I missing something, or are these errors the result of a bug?
>>
>> Best Regards,
>>
>> Dr. Taylor Barnes,
>> Lawrence Berkeley National Laboratory
>>
>>
>> =================
>> Run Command:
>> =================
>>
>> srun -n 96 pw.x -nbgrp 4 -in input > input.out
>>
>>
>>
>> =================
>> Input File:
>> =================
>>
>> &control
>> prefix = 'water'
>> calculation = 'scf'
>> restart_mode = 'from_scratch'
>> wf_collect = .true.
>> disk_io = 'none'
>> tstress = .false.
>> tprnfor = .false.
>> outdir = './'
>> wfcdir = './'
>> pseudo_dir = '/global/homes/t/tabarnes/espresso/pseudo'
>> /
>> &system
>> ibrav = 1
>> celldm(1) = 15.249332837
>> nat = 48
>> ntyp = 2
>> ecutwfc = 130
>> input_dft = 'pbe0'
>> /
>> &electrons
>> diago_thr_init=5.0d-4
>> mixing_mode = 'plain'
>> mixing_beta = 0.7
>> mixing_ndim = 8
>> diagonalization = 'david'
>> diago_david_ndim = 4
>> diago_full_acc = .true.
>> electron_maxstep=3
>> scf_must_converge=.false.
>> /
>> ATOMIC_SPECIES
>> O 15.999 O.pbe-mt_fhi.UPF
>> H 1.008 H.pbe-mt_fhi.UPF
>> ATOMIC_POSITIONS alat
>> O 0.405369 0.567356 0.442192
>> H 0.471865 0.482160 0.381557
>> H 0.442867 0.572759 0.560178
>> O 0.584679 0.262476 0.215740
>> H 0.689058 0.204790 0.249459
>> H 0.503275 0.179176 0.173433
>> O 0.613936 0.468084 0.701359
>> H 0.720162 0.421081 0.658182
>> H 0.629377 0.503798 0.819016
>> O 0.692499 0.571474 0.008796
>> H 0.815865 0.562339 0.016182
>> H 0.640331 0.489132 0.085318
>> O 0.138542 0.767947 0.322270
>> H 0.052664 0.771819 0.411531
>> H 0.239736 0.710419 0.364788
>> O 0.127282 0.623278 0.765792
>> H 0.075781 0.693268 0.677441
>> H 0.243000 0.662182 0.787094
>> O 0.572799 0.844477 0.542529
>> H 0.556579 0.966998 0.533420
>> H 0.548297 0.791340 0.433292
>> O -0.007677 0.992860 0.095967
>> H 0.064148 1.011844 -0.003219
>> H 0.048026 0.913005 0.172625
>> O 0.035337 0.547318 0.085085
>> H 0.072732 0.625835 0.173379
>> H 0.089917 0.576762 -0.022194
>> O 0.666008 0.900155 0.183677
>> H 0.773299 0.937456 0.134145
>> H 0.609289 0.822407 0.105606
>> O 0.443447 0.737755 0.836152
>> H 0.526041 0.665651 0.893906
>> H 0.483300 0.762549 0.721464
>> O 0.934493 0.378765 0.627850
>> H 1.012721 0.449242 0.693201
>> H 0.955703 0.394823 0.506816
>> O 0.006386 0.270244 0.269327
>> H 0.021231 0.364797 0.190612
>> H 0.021863 0.163251 0.208755
>> O 0.936337 0.855942 0.611999
>> H 0.956610 0.972475 0.648965
>> H 0.815045 0.839173 0.592915
>> O 0.228881 0.037509 0.849634
>> H 0.263938 0.065862 0.734213
>> H 0.282576 -0.068680 0.884220
>> O 0.346187 0.176679 0.553828
>> H 0.247521 0.218347 0.491489
>> H 0.402671 0.271609 0.610010
>> K_POINTS automatic
>> 1 1 1 1 1 1
>>
>>
>>
>>
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://pwscf.org/mailman/listinfo/pw_forum
>>
>
>
>
> --
> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> Phone +39-0432-558216, fax +39-0432-558222
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20160126/f27a8e6f/attachment.html>
More information about the users
mailing list