[Pw_forum] Possible bug in QE 5.3.0 band group parallelization

Taylor Barnes tbarnes at lbl.gov
Wed Jan 27 08:35:59 CET 2016


Hi Paolo,

   Thanks very much for this.  Just to clarify:

Version 5.2.0, no band parallelization - Works
Version 5.2.0, with band parallelization - Works
Version 5.3.0, new call to mp_start_diag, no band parallelization - Works
Version 5.3.0, old call to mp_start_diag, no band parallelization - Works
Version 5.3.0, new call to mp_start_diag, with band parallelization - fails
with "Duplicate ranks in rank array" errors
Version 5.3.0, old call to mp_start_diag, with band parallelization - fails
with "problems computing cholesky" errors

   All of these tests are performed on a Cray XC.

Best,
Taylor


On Tue, Jan 26, 2016 at 3:23 AM, Paolo Giannozzi <p.giannozzi at gmail.com>
wrote:

> Recent changes to the way band parallelization is performed seem to be
> incompatible with Scalapack. The problem is related to the obscure hacks
> needed to convince Scalapack to work in a subgroup of processors. If you
> revert to the previous way of setting linear-algebra parallelization,
> things should work (or not work) as before, so the latter problem you
> mention may have other origins. You should verify if you manage to run
> - with the new version, old call to mp_start_diag, no band parallelization
> - with an old version, with or withou band parallelization
> BEWARE: all versions < 5.3 use an incorrect definition of B3LYP, leading
> to small but non-negligible discrepancies with the results of other codes
>
> Paolo
>
> On Tue, Jan 26, 2016 at 12:53 AM, Taylor Barnes <tbarnes at lbl.gov> wrote:
>
>> Dear All,
>>
>>    I have found that calculations involving band group parallelism that
>> worked correctly using QE 5.2.0 produce errors in version 5.3.0 (see below
>> for an example input file).  In particular, when I run a PBE0 calculation
>> with either nbgrp or ndiag set to 1, everything runs correctly; however,
>> when I run a calculation with both nbgrp and ndiag set greater than 1, the
>> calculation immediately fails with the following error messages:
>>
>> Rank 48 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n2] Fatal error in
>> PMPI_Group_incl: Invalid rank, error stack:
>> PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36,
>> ranks=0x53a3c80, new_group=0x7fffffff6794) failed
>> MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index
>> 12, has value 0 which is also the value at index 0
>> Rank 93 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n3] Fatal error in
>> PMPI_Group_incl: Invalid rank, error stack:
>> PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36,
>> ranks=0x538fdf0, new_group=0x7fffffff6794) failed
>> MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index
>> 12, has value 0 which is also the value at index 0
>> etc...
>>
>>    The error is apparently related to a change in Modules/mp_global.f90
>> on line 80.  Here, the line previously read:
>>
>> CALL mp_start_diag  ( ndiag_, intra_BGRP_comm )
>>
>> In QE 5.3.0, this has been changed to:
>>
>> CALL mp_start_diag  ( ndiag_, intra_POOL_comm )
>>
>>    The call using intra_BGRP_comm still exists in version 5.3.0 of the
>> code, but is commented out, and the surrounding comments indicate that it
>> should be possible to switch back to the old parallelization by
>> commenting/uncommenting as desired.  When I do this, I find that instead of
>> the error messages described above, I get the following error messages:
>>
>> Error in routine  cdiaghg(193):
>>   problems computing cholesky
>>
>>    Am I missing something, or are these errors the result of a bug?
>>
>> Best Regards,
>>
>> Dr. Taylor Barnes,
>> Lawrence Berkeley National Laboratory
>>
>>
>> =================
>> Run Command:
>> =================
>>
>> srun -n 96 pw.x -nbgrp 4 -in input > input.out
>>
>>
>>
>> =================
>> Input File:
>> =================
>>
>> &control
>> prefix = 'water'
>> calculation = 'scf'
>> restart_mode = 'from_scratch'
>> wf_collect = .true.
>> disk_io = 'none'
>> tstress = .false.
>> tprnfor = .false.
>> outdir = './'
>> wfcdir = './'
>> pseudo_dir = '/global/homes/t/tabarnes/espresso/pseudo'
>> /
>> &system
>> ibrav = 1
>> celldm(1) = 15.249332837
>> nat = 48
>> ntyp = 2
>> ecutwfc = 130
>> input_dft = 'pbe0'
>> /
>> &electrons
>> diago_thr_init=5.0d-4
>> mixing_mode = 'plain'
>> mixing_beta = 0.7
>> mixing_ndim = 8
>> diagonalization = 'david'
>> diago_david_ndim = 4
>> diago_full_acc = .true.
>> electron_maxstep=3
>> scf_must_converge=.false.
>> /
>> ATOMIC_SPECIES
>> O   15.999   O.pbe-mt_fhi.UPF
>> H    1.008   H.pbe-mt_fhi.UPF
>> ATOMIC_POSITIONS alat
>>  O   0.405369   0.567356   0.442192
>>  H   0.471865   0.482160   0.381557
>>  H   0.442867   0.572759   0.560178
>>  O   0.584679   0.262476   0.215740
>>  H   0.689058   0.204790   0.249459
>>  H   0.503275   0.179176   0.173433
>>  O   0.613936   0.468084   0.701359
>>  H   0.720162   0.421081   0.658182
>>  H   0.629377   0.503798   0.819016
>>  O   0.692499   0.571474   0.008796
>>  H   0.815865   0.562339   0.016182
>>  H   0.640331   0.489132   0.085318
>>  O   0.138542   0.767947   0.322270
>>  H   0.052664   0.771819   0.411531
>>  H   0.239736   0.710419   0.364788
>>  O   0.127282   0.623278   0.765792
>>  H   0.075781   0.693268   0.677441
>>  H   0.243000   0.662182   0.787094
>>  O   0.572799   0.844477   0.542529
>>  H   0.556579   0.966998   0.533420
>>  H   0.548297   0.791340   0.433292
>>  O  -0.007677   0.992860   0.095967
>>  H   0.064148   1.011844  -0.003219
>>  H   0.048026   0.913005   0.172625
>>  O   0.035337   0.547318   0.085085
>>  H   0.072732   0.625835   0.173379
>>  H   0.089917   0.576762  -0.022194
>>  O   0.666008   0.900155   0.183677
>>  H   0.773299   0.937456   0.134145
>>  H   0.609289   0.822407   0.105606
>>  O   0.443447   0.737755   0.836152
>>  H   0.526041   0.665651   0.893906
>>  H   0.483300   0.762549   0.721464
>>  O   0.934493   0.378765   0.627850
>>  H   1.012721   0.449242   0.693201
>>  H   0.955703   0.394823   0.506816
>>  O   0.006386   0.270244   0.269327
>>  H   0.021231   0.364797   0.190612
>>  H   0.021863   0.163251   0.208755
>>  O   0.936337   0.855942   0.611999
>>  H   0.956610   0.972475   0.648965
>>  H   0.815045   0.839173   0.592915
>>  O   0.228881   0.037509   0.849634
>>  H   0.263938   0.065862   0.734213
>>  H   0.282576  -0.068680   0.884220
>>  O   0.346187   0.176679   0.553828
>>  H   0.247521   0.218347   0.491489
>>  H   0.402671   0.271609   0.610010
>> K_POINTS automatic
>> 1 1 1 1 1 1
>>
>>
>>
>>
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://pwscf.org/mailman/listinfo/pw_forum
>>
>
>
>
> --
> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> Phone +39-0432-558216, fax +39-0432-558222
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20160126/f27a8e6f/attachment.html>


More information about the users mailing list