[QE-developers] Understanding parallelized distribution of real space data

Sun Aug 25 21:10:15 CEST 2024

Hello,

I’m trying to understand how real space arrays are distributed across
processors in parallelized pw calculations, and how the per-process arrays
can be mapped back into the global, full-size array.  I’m running a simple
SCF calculation, specifying a 112 by 112 by 112 point FFT grid.

I can see by checking the values of my_nr3p, my_nr2p, my_i0r3p and my_i0r2p
within fft_type_descriptor that the YZ plane is being split into 10 blocks.
Processors one and two are assigned blocks containing 12 elements along the
z-axis and all 112 elements along the y-axis, with z-axis offsets of 0 and
12, respectively. Processors three through ten are assigned blocks
containing 11 elements along the z-axis and 112 elements along the y-axis,
with z-axis offsets of 24, 35, 46, et cetera.

Initially, I thought that reorganizing these per-processor arrays into the
112 by 112 by 112 global array would be simple: processor 1 has every
element with a y-index in the range [1, 12], processor three has every
element with a y-index in the range [25, 35], et cetera. However, on each
processor, I see that the nnr property within fft_type_descriptor has a
value of 150,528. Additionally, I see that this value is actually used to
allocate memory for per-processor real space data, such as within the
create_scf_type subroutine. On processors one and two, I do expect nnr to
have this value of 150,528 = 12 * 112 * 112. However, on processors three
through ten, I would instead expect to see 11 * 112 * 112 = 137,984.

Given the extra data in these arrays, how should they be arranged in order
to obtain the global 112 by 112 by 112 array?

Thank you,

James Telzrow
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20240825/b6bedefc/attachment.html>