[QE-users] MPI problem when parallelizing with more than 12 cores
a.pramos
a.pramos at alumnos.upm.es
Wed Mar 15 09:44:41 CET 2023
Dear everyone,
I am running QE 7.1 in an AMD EPYC 7763 build and Ubuntu 22.04 LTS. When
launching an input with the following command:
mpirun -np 24 pw.x <NiOH3.in> NiOH3.out
The program fails after writing these lines in the output:
Estimated max dynamical RAM per process > 4.68 GB
Estimated total dynamical RAM > 112.25 GB
Check: negative core charge= -0.000002
Generating pointlists ...
new r_m : 0.0031 (alat units) 0.0555 (a.u.) for type 1
new r_m : 0.0031 (alat units) 0.0555 (a.u.) for type 2
new r_m : 0.0031 (alat units) 0.0555 (a.u.) for type 3
Initial potential from superposition of free atoms
starting charge 809.9994, renormalised to 864.0000
Starting wfcs are 621 randomized atomic wfcs
No CRASH file is generated and the lines in the console mention a
problem with MPI communication:
[ec4302-MZ01-CE1-00:56280] *** An error occurred in MPI_Comm_free
[ec4302-MZ01-CE1-00:56280] *** reported by process [507510785,1]
[ec4302-MZ01-CE1-00:56280] *** on communicator MPI_COMM_WORLD
[ec4302-MZ01-CE1-00:56280] *** MPI_ERR_COMM: invalid communicator
[ec4302-MZ01-CE1-00:56280] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[ec4302-MZ01-CE1-00:56280] *** and potentially your MPI job)
[ec4302-MZ01-CE1-00:56275] 4 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[ec4302-MZ01-CE1-00:56275] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages
This only happens, however, when using 24 cores. Using 12 makes the code
run as usual, and using the same parallelization with other softwares
has not given any trouble, which suggests me this is not a problem of
MPI.
Best regards,
Álvaro
More information about the users
mailing list