[QE-users] Out of Memory on GPU cluster with SOC

Christian Kern christian.kern at uni-graz.at
Wed Sep 10 10:11:29 CEST 2025


Dear all!

I am running a fairly large system (11796 electrons, 784 atoms including 
Au, W, S, and unit cell volume ~24200 Angstrom^3) with QE 7.4 on the GPU 
partition of the Leonardo cluster (4 A100 GPUs coupled with 32 cores per 
node). This calculation can be easily run (scf convergence in ~4 hours 
on 12 nodes), even for larger cells (I tried up to ~1200 atoms) with 
pseudodojo PAW potentials and 30 Ry ecutwfc on the Gamma point.

However, now I want to run the same calculation with spin-orbit coupling 
(ispin=4, etc.) and QE always fails due to memory allocation issues. 
This is regardless of the number of nodes that I use and the estimated 
max. dynamic memory, reported in the output file, is below the available 
memory per mpi-process/GPU (64 GB). Running on 64 nodes (=256 
mpi-procs/GPUs), I can get this estimated RAM requirement down to ~12GB 
per GPU. Despite that, I get an error that ~16GB cannot be allocated in 
the first step. Surprisingly, those ~16GB are always the same number, 
regardless of the number of nodes that I use and regardless of the 
estimated memory consumption. I am using the davidson algorithm and NC 
PP with 60 Ry cutoff here, but have also tested relativistic PAW PP, 
lower cutoffs, less vacuum and the conjugent-gradient algorithm. None of 
this helped. What can the problem be here with the memory allocation? In 
smaller systems I have no problems with SOC calculations on GPUs...

As far as I understand, the maximum number of mpi-processes/GPUs for a 
Gamma point scf calculation is given by the third dimension of the FFT 
mesh, and parallelization in task groups is not available for GPUs? Then 
the only way to further reduce the memory demand is by using "-ndiag". 
In my case, this prolongs the execution time before the job dies, but I 
am still running out of memory, although now because of ridiculously 
small ammounts (~100MB). I tried up to "-ndiag 64"...

Looking forward to your suggestions,
Christian Kern


More information about the users mailing list