[Pw_forum] Possible bug in QE 5.3.0 band group parallelization
Taylor Barnes
tbarnes at lbl.gov
Tue Jan 26 00:53:16 CET 2016
Dear All,
I have found that calculations involving band group parallelism that
worked correctly using QE 5.2.0 produce errors in version 5.3.0 (see below
for an example input file). In particular, when I run a PBE0 calculation
with either nbgrp or ndiag set to 1, everything runs correctly; however,
when I run a calculation with both nbgrp and ndiag set greater than 1, the
calculation immediately fails with the following error messages:
Rank 48 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n2] Fatal error in
PMPI_Group_incl: Invalid rank, error stack:
PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36,
ranks=0x53a3c80, new_group=0x7fffffff6794) failed
MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index
12, has value 0 which is also the value at index 0
Rank 93 [Mon Jan 25 09:52:04 2016] [c0-0c0s14n3] Fatal error in
PMPI_Group_incl: Invalid rank, error stack:
PMPI_Group_incl(173).............: MPI_Group_incl(group=0x88000002, n=36,
ranks=0x538fdf0, new_group=0x7fffffff6794) failed
MPIR_Group_check_valid_ranks(259): Duplicate ranks in rank array at index
12, has value 0 which is also the value at index 0
etc...
The error is apparently related to a change in Modules/mp_global.f90 on
line 80. Here, the line previously read:
CALL mp_start_diag ( ndiag_, intra_BGRP_comm )
In QE 5.3.0, this has been changed to:
CALL mp_start_diag ( ndiag_, intra_POOL_comm )
The call using intra_BGRP_comm still exists in version 5.3.0 of the
code, but is commented out, and the surrounding comments indicate that it
should be possible to switch back to the old parallelization by
commenting/uncommenting as desired. When I do this, I find that instead of
the error messages described above, I get the following error messages:
Error in routine cdiaghg(193):
problems computing cholesky
Am I missing something, or are these errors the result of a bug?
Best Regards,
Dr. Taylor Barnes,
Lawrence Berkeley National Laboratory
=================
Run Command:
=================
srun -n 96 pw.x -nbgrp 4 -in input > input.out
=================
Input File:
=================
&control
prefix = 'water'
calculation = 'scf'
restart_mode = 'from_scratch'
wf_collect = .true.
disk_io = 'none'
tstress = .false.
tprnfor = .false.
outdir = './'
wfcdir = './'
pseudo_dir = '/global/homes/t/tabarnes/espresso/pseudo'
/
&system
ibrav = 1
celldm(1) = 15.249332837
nat = 48
ntyp = 2
ecutwfc = 130
input_dft = 'pbe0'
/
&electrons
diago_thr_init=5.0d-4
mixing_mode = 'plain'
mixing_beta = 0.7
mixing_ndim = 8
diagonalization = 'david'
diago_david_ndim = 4
diago_full_acc = .true.
electron_maxstep=3
scf_must_converge=.false.
/
ATOMIC_SPECIES
O 15.999 O.pbe-mt_fhi.UPF
H 1.008 H.pbe-mt_fhi.UPF
ATOMIC_POSITIONS alat
O 0.405369 0.567356 0.442192
H 0.471865 0.482160 0.381557
H 0.442867 0.572759 0.560178
O 0.584679 0.262476 0.215740
H 0.689058 0.204790 0.249459
H 0.503275 0.179176 0.173433
O 0.613936 0.468084 0.701359
H 0.720162 0.421081 0.658182
H 0.629377 0.503798 0.819016
O 0.692499 0.571474 0.008796
H 0.815865 0.562339 0.016182
H 0.640331 0.489132 0.085318
O 0.138542 0.767947 0.322270
H 0.052664 0.771819 0.411531
H 0.239736 0.710419 0.364788
O 0.127282 0.623278 0.765792
H 0.075781 0.693268 0.677441
H 0.243000 0.662182 0.787094
O 0.572799 0.844477 0.542529
H 0.556579 0.966998 0.533420
H 0.548297 0.791340 0.433292
O -0.007677 0.992860 0.095967
H 0.064148 1.011844 -0.003219
H 0.048026 0.913005 0.172625
O 0.035337 0.547318 0.085085
H 0.072732 0.625835 0.173379
H 0.089917 0.576762 -0.022194
O 0.666008 0.900155 0.183677
H 0.773299 0.937456 0.134145
H 0.609289 0.822407 0.105606
O 0.443447 0.737755 0.836152
H 0.526041 0.665651 0.893906
H 0.483300 0.762549 0.721464
O 0.934493 0.378765 0.627850
H 1.012721 0.449242 0.693201
H 0.955703 0.394823 0.506816
O 0.006386 0.270244 0.269327
H 0.021231 0.364797 0.190612
H 0.021863 0.163251 0.208755
O 0.936337 0.855942 0.611999
H 0.956610 0.972475 0.648965
H 0.815045 0.839173 0.592915
O 0.228881 0.037509 0.849634
H 0.263938 0.065862 0.734213
H 0.282576 -0.068680 0.884220
O 0.346187 0.176679 0.553828
H 0.247521 0.218347 0.491489
H 0.402671 0.271609 0.610010
K_POINTS automatic
1 1 1 1 1 1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20160125/fbab3ce7/attachment.html>
More information about the users
mailing list