[QE-users] Out of Memory on GPU cluster with SOC
Christian Kern
christian.kern at uni-graz.at
Wed Sep 10 10:11:29 CEST 2025
Dear all!
I am running a fairly large system (11796 electrons, 784 atoms including
Au, W, S, and unit cell volume ~24200 Angstrom^3) with QE 7.4 on the GPU
partition of the Leonardo cluster (4 A100 GPUs coupled with 32 cores per
node). This calculation can be easily run (scf convergence in ~4 hours
on 12 nodes), even for larger cells (I tried up to ~1200 atoms) with
pseudodojo PAW potentials and 30 Ry ecutwfc on the Gamma point.
However, now I want to run the same calculation with spin-orbit coupling
(ispin=4, etc.) and QE always fails due to memory allocation issues.
This is regardless of the number of nodes that I use and the estimated
max. dynamic memory, reported in the output file, is below the available
memory per mpi-process/GPU (64 GB). Running on 64 nodes (=256
mpi-procs/GPUs), I can get this estimated RAM requirement down to ~12GB
per GPU. Despite that, I get an error that ~16GB cannot be allocated in
the first step. Surprisingly, those ~16GB are always the same number,
regardless of the number of nodes that I use and regardless of the
estimated memory consumption. I am using the davidson algorithm and NC
PP with 60 Ry cutoff here, but have also tested relativistic PAW PP,
lower cutoffs, less vacuum and the conjugent-gradient algorithm. None of
this helped. What can the problem be here with the memory allocation? In
smaller systems I have no problems with SOC calculations on GPUs...
As far as I understand, the maximum number of mpi-processes/GPUs for a
Gamma point scf calculation is given by the third dimension of the FFT
mesh, and parallelization in task groups is not available for GPUs? Then
the only way to further reduce the memory demand is by using "-ndiag".
In my case, this prolongs the execution time before the job dies, but I
am still running out of memory, although now because of ridiculously
small ammounts (~100MB). I tried up to "-ndiag 64"...
Looking forward to your suggestions,
Christian Kern
More information about the users
mailing list