[QE-users] Out of Memory on GPU cluster with SOC

Thu Sep 11 21:11:00 CEST 2025

Spin-orbit calculations should take approx. 4 times the memory of 
unpolarized calculations. Without the input file it is hard to say more.

Paolo

On 9/10/2025 10:11 AM, Christian Kern wrote:
> [Non ricevi spesso messaggi di posta elettronica da christian.kern at uni- 
> graz.at. Per informazioni sull'importanza di questo fatto, visita 
> https://aka.ms/LearnAboutSenderIdentification.]
> 
> Dear all!
> 
> I am running a fairly large system (11796 electrons, 784 atoms including
> Au, W, S, and unit cell volume ~24200 Angstrom^3) with QE 7.4 on the GPU
> partition of the Leonardo cluster (4 A100 GPUs coupled with 32 cores per
> node). This calculation can be easily run (scf convergence in ~4 hours
> on 12 nodes), even for larger cells (I tried up to ~1200 atoms) with
> pseudodojo PAW potentials and 30 Ry ecutwfc on the Gamma point.
> 
> However, now I want to run the same calculation with spin-orbit coupling
> (ispin=4, etc.) and QE always fails due to memory allocation issues.
> This is regardless of the number of nodes that I use and the estimated
> max. dynamic memory, reported in the output file, is below the available
> memory per mpi-process/GPU (64 GB). Running on 64 nodes (=256
> mpi-procs/GPUs), I can get this estimated RAM requirement down to ~12GB
> per GPU. Despite that, I get an error that ~16GB cannot be allocated in
> the first step. Surprisingly, those ~16GB are always the same number,
> regardless of the number of nodes that I use and regardless of the
> estimated memory consumption. I am using the davidson algorithm and NC
> PP with 60 Ry cutoff here, but have also tested relativistic PAW PP,
> lower cutoffs, less vacuum and the conjugent-gradient algorithm. None of
> this helped. What can the problem be here with the memory allocation? In
> smaller systems I have no problems with SOC calculations on GPUs...
> 
> As far as I understand, the maximum number of mpi-processes/GPUs for a
> Gamma point scf calculation is given by the third dimension of the FFT
> mesh, and parallelization in task groups is not available for GPUs? Then
> the only way to further reduce the memory demand is by using "-ndiag".
> In my case, this prolongs the execution time before the job dies, but I
> am still running out of memory, although now because of ridiculously
> small ammounts (~100MB). I tried up to "-ndiag 64"...
> 
> Looking forward to your suggestions,
> Christian Kern
> _______________________________________________________________________________
> The Quantum ESPRESSO Foundation stands in solidarity with all civilians 
> worldwide who are victims of terrorism, military aggression, and 
> indiscriminate warfare.
> --------------------------------------------------------------------------------
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users

-- 
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216