[QE-users] Test-suite failures with parallel QE
Ian Dunn
ian.dunn at asml.com
Mon Apr 7 17:14:28 CEST 2025
Hi all,
I'm having a few isolated failed tests in the test-suite as well as a general OpenFabrics initialization error and want to check why these are happening and if it's "OK". I'm able to get all tests that don't skip to pass with serial compilation using gfortran 13.1.0. I only get failures when I switch to parallel compilation using openmpi/4.1.6. Can anyone help steer me in a direction for how to get a robust parallel compilation? Thanks in advance!
Some details on my configuration:
GCC/Gfortran 13.1.0
QE 7.4.1
Openmpi 4.1.6
Running make run-tests NPROCS=12
Red Hat Enterprise Linux 8
Using QE internal BLAS & LAPACK
Many of the tests are having errors like the following, even if they pass:
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: pn5657
Local device: mlx5_0
--------------------------------------------------------------------------
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
[pn5657:3197139] 11 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[pn5657:3197139] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Here are the tests that are failing:
1. pw_plugins - plugin-pw2casino_1.in (arg(s): 1): **FAILED**.
Different sets of data extracted from benchmark and test.
Data only in benchmark: p1.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Error in routine pw2casino (1):
pool/band/image parallelization not (yet) implemented
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
stopping ...
1. pw_vdw - xdm.in: **FAILED**.
ef1
ERROR: absolute error 5.62e-01 greater than 8.00e-02. (Test: 10.7872. Benchmark: 10.2253.)
ERROR: relative error 5.50e-02 greater than 2.00e-02. (Test: 10.7872. Benchmark: 10.2253.)
1. cp_al_edft - Al.uspp.in: **FAILED**.
t1
ERROR: absolute error 1.75e-02 greater than 6.00e-03. (Test: 159.46581. Benchmark: 159.44833.)
1. ph_1d - ch4.scf.in (arg(s): 1): **FAILED**.
n1
ERROR: absolute error 6.00e+00 greater than 5.00e+00. (Test: 32.0. Benchmark: 26.0.)
1. /hpc/data/idunn/qe/7.4.1/test-suite/..//test-suite/run-hp.sh 2 Fe.scf.in test.out.070425-2.inp=Fe.scf.in.args=2 test.err.070425-2.inp=Fe.scf.in.args=2
Running PW ...
mpirun -np 12 /hpc/data/idunn/qe/7.4.1/test-suite/..//bin/pw.x < Fe.scf.in > test.out.070425-2.inp=Fe.scf.in.args=2 2> test.err.070425-2.inp=Fe.scf.in.args=2
hp_metal_paw_magn - Fe.scf.in (arg(s): 2): **FAILED**.
n1
ERROR: absolute error 6.00e+00 greater than 5.00e+00. (Test: 31.0. Benchmark: 25.0.)
1. /hpc/data/idunn/qe/7.4.1/test-suite/..//test-suite/run-hp.sh 4 bn.hp.in test.out.070425-2.inp=bn.hp.in.args=4 test.err.070425-2.inp=bn.hp.in.args=4
Running HP ...
mpirun -np 12 /hpc/data/idunn/qe/7.4.1/test-suite/..//bin/hp.x < bn.hp.in > test.out.070425-2.inp=bn.hp.in.args=4 2> test.err.070425-2.inp=bn.hp.in.args=4
hp_soc_UV_paw_magn - bn.hp.in (arg(s): 4): **FAILED**.
v2
ERROR: absolute error 1.37e-02 greater than 1.50e-03. (Test: -0.1254. Benchmark: -0.1117.)
ERROR: relative error 1.23e-01 greater than 1.80e-04. (Test: -0.1254. Benchmark: -0.1117.)
v1
ERROR: absolute error 1.72e+00 greater than 1.50e-03. (Test: 6.4294. Benchmark: 4.7069.)
ERROR: relative error 3.66e-01 greater than 1.20e-04. (Test: 6.4294. Benchmark: 4.7069.)
u
ERROR: absolute error 1.72e+00 greater than 1.50e-03. (Test: 6.4294. Benchmark: 4.7069.)
ERROR: relative error 3.66e-01 greater than 1.20e-04. (Test: 6.4294. Benchmark: 4.7069.)
1. It seems all the KCW tests that need the kcw executable are failing with error messages like:
mpirun was unable to launch the specified application as it could not access
or execute an executable:
Executable: /hpc/data/sm-euv_rs/idunn/qe/7.4.1/test-suite/..//bin/kcw.x
Node: pn5657
while attempting to start process rank 0.
I'm not sure why kcw.x isn't in the bin folder.
Best regards,
Ian Dunn (he/him)
ASML Wilton MDEV Analysis Architect
--- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20250407/367bfd7a/attachment.html>
More information about the users
mailing list