[QE-users] MPI problem when parallelizing with more than 12 cores

a.pramos a.pramos at alumnos.upm.es
Fri Mar 31 08:03:47 CEST 2023


Thank you, Paolo

I've now compiled with Scalapack, and the problem has though changed to 
something else. When launching the code with the command mpirun -np 32 
pw.x -nd 1 < NiOHsupercell1.in > NiOHsupercell1.out , I get the 
following error:
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 9 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[ec4302-MZ01-CE1-00:111052] PMIX ERROR: UNREACHABLE in file 
server/pmix_server.c at line 2198
[ec4302-MZ01-CE1-00:111052] PMIX ERROR: UNREACHABLE in file 
server/pmix_server.c at line 2198
[ec4302-MZ01-CE1-00:111052] PMIX ERROR: UNREACHABLE in file 
server/pmix_server.c at line 2198
[ec4302-MZ01-CE1-00:111052] PMIX ERROR: UNREACHABLE in file 
server/pmix_server.c at line 2198
[ec4302-MZ01-CE1-00:111052] PMIX ERROR: UNREACHABLE in file 
server/pmix_server.c at line 2198
[ec4302-MZ01-CE1-00:111052] 5 more processes have sent help message 
help-mpi-api.txt / mpi-abort
[ec4302-MZ01-CE1-00:111052] Set MCA parameter "orte_base_help_aggregate" 
to 0 to see all help / error messages

For some reason, 22 is the maximum number of threads usable, but this 
does not happen when running parallel jobs in other softwares. Changing 
the nd number does not change anything.
During configuration of QE, this was the command I used:
./configure MPIF90=mpif90 CC=gcc --enable-parallel --with-scalapack 
SCALAPACK_LIBS="-L/home/ec4302/scalapack-2.2.0 -lscalapack" BLAS_LIBS 
="-L/usr/lib/x86_64-linux-gnu/blas -lblas" LAPACK_LIBS ="-L/usr/lib

I don't know much about PMIX but seems to be used only with OpenMPI, 
which isn't the case here

Álvaro

El 2023-03-22 16:38, Paolo Giannozzi escribió:
> After two independent reports of a similar problem, several tests and
> a considerable amount of head-scratching, I came to the conclusion
> that the problem is no longer present in the development version (to
> be released as v.7.2 no later than this week). For more explanations,
> see issue https://gitlab.com/QEF/q-e/-/issues/572, the related
> comments and the fix provided by Miroslav Iliaš
> 
> Paolo
> 
> On 15/03/2023 09:44, a.pramos wrote:
>> Dear everyone,
>> 
>> I am running QE 7.1 in an AMD EPYC 7763 build and Ubuntu 22.04 LTS. 
>> When launching an input with the following command:
>> mpirun -np 24 pw.x <NiOH3.in> NiOH3.out
>> 
>> The program fails after writing these lines in the output:
>>       Estimated max dynamical RAM per process >       4.68 GB
>> 
>>       Estimated total dynamical RAM >     112.25 GB
>> 
>>       Check: negative core charge=   -0.000002
>>       Generating pointlists ...
>>       new r_m :   0.0031 (alat units)  0.0555 (a.u.) for type    1
>>       new r_m :   0.0031 (alat units)  0.0555 (a.u.) for type    2
>>       new r_m :   0.0031 (alat units)  0.0555 (a.u.) for type    3
>> 
>>       Initial potential from superposition of free atoms
>> 
>>       starting charge     809.9994, renormalised to     864.0000
>>       Starting wfcs are  621 randomized atomic wfcs
>> 
>> No CRASH file is generated and the lines in the console mention a 
>> problem with MPI communication:
>> [ec4302-MZ01-CE1-00:56280] *** An error occurred in MPI_Comm_free
>> [ec4302-MZ01-CE1-00:56280] *** reported by process [507510785,1]
>> [ec4302-MZ01-CE1-00:56280] *** on communicator MPI_COMM_WORLD
>> [ec4302-MZ01-CE1-00:56280] *** MPI_ERR_COMM: invalid communicator
>> [ec4302-MZ01-CE1-00:56280] *** MPI_ERRORS_ARE_FATAL (processes in this 
>> communicator will now abort,
>> [ec4302-MZ01-CE1-00:56280] ***    and potentially your MPI job)
>> [ec4302-MZ01-CE1-00:56275] 4 more processes have sent help message 
>> help-mpi-errors.txt / mpi_errors_are_fatal
>> [ec4302-MZ01-CE1-00:56275] Set MCA parameter 
>> "orte_base_help_aggregate" to 0 to see all help / error messages
>> 
>> This only happens, however, when using 24 cores. Using 12 makes the 
>> code run as usual, and using the same parallelization with other 
>> softwares has not given any trouble, which suggests me this is not a 
>> problem of MPI.
>> 
>> Best regards,
>> Álvaro
>> _______________________________________________
>> The Quantum ESPRESSO community stands by the Ukrainian
>> people and expresses its concerns about the devastating
>> effects that the Russian military offensive has on their
>> country and on the free and peaceful scientific, cultural,
>> and economic cooperation amongst peoples
>> _______________________________________________
>> Quantum ESPRESSO is supported by MaX 
>> (https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.max-centre.eu%2F&data=05%7C01%7Cpaolo.giannozzi%40uniud.it%7Cee3a57f8aece40daa5b708db25319a1c%7C6e6ade15296c4224ac581c8ec2fd53a8%7C0%7C0%7C638144667190947424%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TaWearoihki4Ny7txtiKHUA7bkntV%2FGLjB9vbVf9P%2Bk%3D&reserved=0)
>> users mailing list users at lists.quantum-espresso.org
>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.quantum-espresso.org%2Fmailman%2Flistinfo%2Fusers&data=05%7C01%7Cpaolo.giannozzi%40uniud.it%7Cee3a57f8aece40daa5b708db25319a1c%7C6e6ade15296c4224ac581c8ec2fd53a8%7C0%7C0%7C638144667190947424%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ve0ZyQjci%2Boq6BlLcTYz5l71VGOC05rPeAWL8%2BHtybs%3D&reserved=0


More information about the users mailing list