[QE-users] MPI problem when parallelizing with more than 12 cores

a.pramos a.pramos at alumnos.upm.es
Wed Mar 15 09:44:41 CET 2023


Dear everyone,

I am running QE 7.1 in an AMD EPYC 7763 build and Ubuntu 22.04 LTS. When 
launching an input with the following command:
mpirun -np 24 pw.x <NiOH3.in> NiOH3.out

The program fails after writing these lines in the output:
      Estimated max dynamical RAM per process >       4.68 GB

      Estimated total dynamical RAM >     112.25 GB

      Check: negative core charge=   -0.000002
      Generating pointlists ...
      new r_m :   0.0031 (alat units)  0.0555 (a.u.) for type    1
      new r_m :   0.0031 (alat units)  0.0555 (a.u.) for type    2
      new r_m :   0.0031 (alat units)  0.0555 (a.u.) for type    3

      Initial potential from superposition of free atoms

      starting charge     809.9994, renormalised to     864.0000
      Starting wfcs are  621 randomized atomic wfcs

No CRASH file is generated and the lines in the console mention a 
problem with MPI communication:
[ec4302-MZ01-CE1-00:56280] *** An error occurred in MPI_Comm_free
[ec4302-MZ01-CE1-00:56280] *** reported by process [507510785,1]
[ec4302-MZ01-CE1-00:56280] *** on communicator MPI_COMM_WORLD
[ec4302-MZ01-CE1-00:56280] *** MPI_ERR_COMM: invalid communicator
[ec4302-MZ01-CE1-00:56280] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[ec4302-MZ01-CE1-00:56280] ***    and potentially your MPI job)
[ec4302-MZ01-CE1-00:56275] 4 more processes have sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[ec4302-MZ01-CE1-00:56275] Set MCA parameter "orte_base_help_aggregate" 
to 0 to see all help / error messages

This only happens, however, when using 24 cores. Using 12 makes the code 
run as usual, and using the same parallelization with other softwares 
has not given any trouble, which suggests me this is not a problem of 
MPI.

Best regards,
Álvaro


More information about the users mailing list