[Pw_forum] Possible bug in QE 5.3.0 band group parallelization

Paolo Giannozzi p.giannozzi at gmail.com
Tue Jan 26 12:23:58 CET 2016


Recent changes to the way band parallelization is performed seem to be
incompatible with Scalapack. The problem is related to the obscure hacks
needed to convince Scalapack to work in a subgroup of processors. If you
revert to the previous way of setting linear-algebra parallelization,
things should work (or not work) as before, so the latter problem you
mention may have other origins. You should verify if you manage to run
- with the new version, old call to mp_start_diag, no band parallelization
- with an old version, with or withou band parallelization
BEWARE: all versions < 5.3 use an incorrect definition of B3LYP, leading to
small but non-negligible discrepancies with the results of other codes

Paolo

On Tue, Jan 26, 2016 at 12:53 AM, Taylor Barnes <tbarnes at lbl.gov> wrote:

> Dear All,
>
>    I have found that calculations involving band group parallelism that
> worked correctly using QE 5.2.0 produce errors in version 5.3.0 (see below
> for an example input file).  In particular, when I run a PBE0 calculation
> with either nbgrp or ndiag set to 1, everything runs correctly; however,
> when I run a calculation with both nbgrp and ndiag set greater than 1, the
> calculation immediately fails with the following error messages:
>
> Rank 48 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n2] Fatal error in
> PMPI_Group_incl: Invalid rank, error stack:
> PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36,
> ranks=0x53a3c80, new_group=0x7fffffff6794) failed
> MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index
> 12, has value 0 which is also the value at index 0
> Rank 93 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n3] Fatal error in
> PMPI_Group_incl: Invalid rank, error stack:
> PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36,
> ranks=0x538fdf0, new_group=0x7fffffff6794) failed
> MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index
> 12, has value 0 which is also the value at index 0
> etc...
>
>    The error is apparently related to a change in Modules/mp_global.f90 on
> line 80.  Here, the line previously read:
>
> CALL mp_start_diag  ( ndiag_, intra_BGRP_comm )
>
> In QE 5.3.0, this has been changed to:
>
> CALL mp_start_diag  ( ndiag_, intra_POOL_comm )
>
>    The call using intra_BGRP_comm still exists in version 5.3.0 of the
> code, but is commented out, and the surrounding comments indicate that it
> should be possible to switch back to the old parallelization by
> commenting/uncommenting as desired.  When I do this, I find that instead of
> the error messages described above, I get the following error messages:
>
> Error in routine  cdiaghg(193):
>   problems computing cholesky
>
>    Am I missing something, or are these errors the result of a bug?
>
> Best Regards,
>
> Dr. Taylor Barnes,
> Lawrence Berkeley National Laboratory
>
>
> =================
> Run Command:
> =================
>
> srun -n 96 pw.x -nbgrp 4 -in input > input.out
>
>
>
> =================
> Input File:
> =================
>
> &control
> prefix = 'water'
> calculation = 'scf'
> restart_mode = 'from_scratch'
> wf_collect = .true.
> disk_io = 'none'
> tstress = .false.
> tprnfor = .false.
> outdir = './'
> wfcdir = './'
> pseudo_dir = '/global/homes/t/tabarnes/espresso/pseudo'
> /
> &system
> ibrav = 1
> celldm(1) = 15.249332837
> nat = 48
> ntyp = 2
> ecutwfc = 130
> input_dft = 'pbe0'
> /
> &electrons
> diago_thr_init=5.0d-4
> mixing_mode = 'plain'
> mixing_beta = 0.7
> mixing_ndim = 8
> diagonalization = 'david'
> diago_david_ndim = 4
> diago_full_acc = .true.
> electron_maxstep=3
> scf_must_converge=.false.
> /
> ATOMIC_SPECIES
> O   15.999   O.pbe-mt_fhi.UPF
> H    1.008   H.pbe-mt_fhi.UPF
> ATOMIC_POSITIONS alat
>  O   0.405369   0.567356   0.442192
>  H   0.471865   0.482160   0.381557
>  H   0.442867   0.572759   0.560178
>  O   0.584679   0.262476   0.215740
>  H   0.689058   0.204790   0.249459
>  H   0.503275   0.179176   0.173433
>  O   0.613936   0.468084   0.701359
>  H   0.720162   0.421081   0.658182
>  H   0.629377   0.503798   0.819016
>  O   0.692499   0.571474   0.008796
>  H   0.815865   0.562339   0.016182
>  H   0.640331   0.489132   0.085318
>  O   0.138542   0.767947   0.322270
>  H   0.052664   0.771819   0.411531
>  H   0.239736   0.710419   0.364788
>  O   0.127282   0.623278   0.765792
>  H   0.075781   0.693268   0.677441
>  H   0.243000   0.662182   0.787094
>  O   0.572799   0.844477   0.542529
>  H   0.556579   0.966998   0.533420
>  H   0.548297   0.791340   0.433292
>  O  -0.007677   0.992860   0.095967
>  H   0.064148   1.011844  -0.003219
>  H   0.048026   0.913005   0.172625
>  O   0.035337   0.547318   0.085085
>  H   0.072732   0.625835   0.173379
>  H   0.089917   0.576762  -0.022194
>  O   0.666008   0.900155   0.183677
>  H   0.773299   0.937456   0.134145
>  H   0.609289   0.822407   0.105606
>  O   0.443447   0.737755   0.836152
>  H   0.526041   0.665651   0.893906
>  H   0.483300   0.762549   0.721464
>  O   0.934493   0.378765   0.627850
>  H   1.012721   0.449242   0.693201
>  H   0.955703   0.394823   0.506816
>  O   0.006386   0.270244   0.269327
>  H   0.021231   0.364797   0.190612
>  H   0.021863   0.163251   0.208755
>  O   0.936337   0.855942   0.611999
>  H   0.956610   0.972475   0.648965
>  H   0.815045   0.839173   0.592915
>  O   0.228881   0.037509   0.849634
>  H   0.263938   0.065862   0.734213
>  H   0.282576  -0.068680   0.884220
>  O   0.346187   0.176679   0.553828
>  H   0.247521   0.218347   0.491489
>  H   0.402671   0.271609   0.610010
> K_POINTS automatic
> 1 1 1 1 1 1
>
>
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
>



-- 
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20160126/b1d29ccf/attachment.html>


More information about the users mailing list