[QE-users] MPI problem when parallelizing with more than 12 cores

Paolo Giannozzi paolo.giannozzi at uniud.it
Wed Mar 22 09:38:16 CET 2023


After two independent reports of a similar problem, several tests and a 
considerable amount of head-scratching, I came to the conclusion that 
the problem is no longer present in the development version (to be 
released as v.7.2 no later than this week). For more explanations, see 
issue https://gitlab.com/QEF/q-e/-/issues/572, the related comments and 
the fix provided by Miroslav Iliaš

Paolo

On 15/03/2023 09:44, a.pramos wrote:
> Dear everyone,
> 
> I am running QE 7.1 in an AMD EPYC 7763 build and Ubuntu 22.04 LTS. When 
> launching an input with the following command:
> mpirun -np 24 pw.x <NiOH3.in> NiOH3.out
> 
> The program fails after writing these lines in the output:
>       Estimated max dynamical RAM per process >       4.68 GB
> 
>       Estimated total dynamical RAM >     112.25 GB
> 
>       Check: negative core charge=   -0.000002
>       Generating pointlists ...
>       new r_m :   0.0031 (alat units)  0.0555 (a.u.) for type    1
>       new r_m :   0.0031 (alat units)  0.0555 (a.u.) for type    2
>       new r_m :   0.0031 (alat units)  0.0555 (a.u.) for type    3
> 
>       Initial potential from superposition of free atoms
> 
>       starting charge     809.9994, renormalised to     864.0000
>       Starting wfcs are  621 randomized atomic wfcs
> 
> No CRASH file is generated and the lines in the console mention a 
> problem with MPI communication:
> [ec4302-MZ01-CE1-00:56280] *** An error occurred in MPI_Comm_free
> [ec4302-MZ01-CE1-00:56280] *** reported by process [507510785,1]
> [ec4302-MZ01-CE1-00:56280] *** on communicator MPI_COMM_WORLD
> [ec4302-MZ01-CE1-00:56280] *** MPI_ERR_COMM: invalid communicator
> [ec4302-MZ01-CE1-00:56280] *** MPI_ERRORS_ARE_FATAL (processes in this 
> communicator will now abort,
> [ec4302-MZ01-CE1-00:56280] ***    and potentially your MPI job)
> [ec4302-MZ01-CE1-00:56275] 4 more processes have sent help message 
> help-mpi-errors.txt / mpi_errors_are_fatal
> [ec4302-MZ01-CE1-00:56275] Set MCA parameter "orte_base_help_aggregate" 
> to 0 to see all help / error messages
> 
> This only happens, however, when using 24 cores. Using 12 makes the code 
> run as usual, and using the same parallelization with other softwares 
> has not given any trouble, which suggests me this is not a problem of MPI.
> 
> Best regards,
> Álvaro
> _______________________________________________
> The Quantum ESPRESSO community stands by the Ukrainian
> people and expresses its concerns about the devastating
> effects that the Russian military offensive has on their
> country and on the free and peaceful scientific, cultural,
> and economic cooperation amongst peoples
> _______________________________________________
> Quantum ESPRESSO is supported by MaX 
> (https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.max-centre.eu%2F&data=05%7C01%7Cpaolo.giannozzi%40uniud.it%7Cee3a57f8aece40daa5b708db25319a1c%7C6e6ade15296c4224ac581c8ec2fd53a8%7C0%7C0%7C638144667190947424%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TaWearoihki4Ny7txtiKHUA7bkntV%2FGLjB9vbVf9P%2Bk%3D&reserved=0)
> users mailing list users at lists.quantum-espresso.org
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.quantum-espresso.org%2Fmailman%2Flistinfo%2Fusers&data=05%7C01%7Cpaolo.giannozzi%40uniud.it%7Cee3a57f8aece40daa5b708db25319a1c%7C6e6ade15296c4224ac581c8ec2fd53a8%7C0%7C0%7C638144667190947424%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ve0ZyQjci%2Boq6BlLcTYz5l71VGOC05rPeAWL8%2BHtybs%3D&reserved=0

-- 
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216


More information about the users mailing list