<div dir="ltr">Hi All,<div><br></div><div>I'm currently working to write a many body dispersion vdW energy and gradients module into QE, and I have it working in serial but it's quite slow so I figured parallelizing it was the way to go. Unfortunately, I'm running into an issue where MPI_GATHER isn't syncing the full vector. I've outlined the key parts of my code below, and I'd love some feedback or ideas of next things to try.</div><div><br></div><div>The time-intensive part of the computation is the construction of many dipole-dipole tensors. Fortunately, these can be precomputed and stored in a lookup table, because there are many degeneracies. Naturally, there are many indexing problems associated with this so, to keep track of everything, I've constructed a derived data type like so:</div><div><br></div><div><div>    type dtlrdh_lut</div><div>        sequence</div><div>        integer p</div><div>        integer q</div><div>        integer ind</div><div>        real(dp), dimension(3, 3) :: TLR</div><div>        real(dp), dimension(3, 3, 3, 3) :: dTLRdh</div><div>    end type dtlrdh_lut</div></div><div><br></div><div>note, TLR is the long range dipole tensor, dTLRdh is its derivative with respect to cell parameters, p and q are atom parameters, and ind is the unique identifier. I defined my variables in my subroutine like so:</div><div><br></div><div><div>        type(dtlrdr_lut), dimension(:), allocatable :: my_dtlrdr, collected_dtlrdr</div><div>        type(dtlrdh_lut), dimension(:), allocatable :: my_dtlrdh, collected_dtlrdh</div></div><div><div>        integer :: dh_dtype, dr_dtype, dh_types(5), dr_types(6), dh_blocks(5), dr_blocks(6)</div><div>        INTEGER(kind=MPI_ADDRESS_KIND) :: dh_offsets(5), dr_offsets(6)</div><div>        integer :: numtasks, rank, ierr, dh_displs(nproc_image), dr_displs(nproc_image)</div><div>        integer :: n, status(mpi_status_size)</div></div><div><br></div><div>I split up the work between processes in another method and then count the number of elements of the lookup table need to be computed and allocated the local lookup tables on this specific node like so:</div><div><div> my_num_pairs = 0</div><div>    do i = 1, num_pairs, 1</div><div>        if(unique_pairs(i)%cpu.eq.me_image) then</div><div>            my_num_pairs = my_num_pairs + 1</div><div>            end if</div><div>        end do</div><div>    if(.not.allocated(my_dtlrdh))    allocate(my_dtlrdh(my_num_pairs))</div><div>I also allocate and zero out the lookup table that everything will be combined back into with the following code: if(me_image.eq.root_image) then if(.not.allocated(collected_dtlrdh)) allocate(collected_dtlrdh(num_pairs))</div><div><br></div><div>        do i=1,my_num_pairs,1</div><div>            collected_dtlrdh(i)%p = 0</div><div>            collected_dtlrdh(i)%q = 0</div><div>            collected_dtlrdh(i)%ind = 0</div><div>            collected_dtlrdh(i)%TLR = 0.0_DP</div><div>            collected_dtlrdh(i)%dTLRdh = 0.0_DP</div><div>            end do</div><div>        end if</div><div>I then fill in the lookup table, but I'll skip that code. It's long and not relevant. With this done, it's time to start the MPI process to gather all back on the root process.</div><div><br></div><div>    call mpi_get_address(my_dtlrdh(1)%p,               dh_offsets(1), ierr)</div><div>    call mpi_get_address(my_dtlrdh(1)%q,               dh_offsets(2), ierr)</div><div>    call mpi_get_address(my_dtlrdh(1)%ind,             dh_offsets(3), ierr)</div><div>    call mpi_get_address(my_dtlrdh(1)%TLR(1,1),        dh_offsets(4), ierr)</div><div>    call mpi_get_address(my_dtlrdh(1)%dTLRdh(1,1,1,1), dh_offsets(5), ierr)</div><div>    do i = 2, size(dh_offsets)</div><div>      dh_offsets(i) = dh_offsets(i) - dh_offsets(1)</div><div>    end do</div><div>    dh_offsets(1) = 0</div><div>    dh_types = (/MPI_INTEGER, MPI_INTEGER, MPI_INTEGER, MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION/)</div><div>    dh_blocks = (/1, 1, 1, 3*3, 3*3*3*3/)</div><div>    call mpi_type_struct(5, dh_blocks, dh_offsets, dh_types, dh_dtype, ierr)</div><div>    call mpi_type_commit(dh_dtype, ierr)</div><div>I then call gather via:</div><div><br></div><div>    call mpi_gather(my_dtlrdh, my_num_pairs, dh_dtype, &</div><div>                     collected_dtlrdh, my_num_pairs, dh_dtype, &</div><div>                     root_image, intra_image_comm, ierr)</div><div>After I gather, I can then print out what everything should look like:</div><div><br></div><div>    do i = 1, num_pairs, 1</div><div>        write(stdout, *) my_dtlrdh(i)%p, collected_dtlrdh(i)%p, my_dtlrdh(i)%q, collected_dtlrdh(i)%q</div><div>        end do</div><div>and this is where I see really strange information. The first few elements that are printed out look fine:</div><div><br></div><div>       1           1           3           3</div><div>       1           1           6           6</div><div>       1           1           9           9</div><div>But the tail end of my vector looks like where I only send 174 elements instead of the full 194:</div><div><br></div><div>      17           0          24           0</div><div>      18           0          19           0</div><div>      18           0          20           0</div><div>      18           0          21           0</div><div>      18           0          22           0</div><div>Given that there are 194 pairs, and both num_pairs and my_num_pairs equal 194, I'm confused. I went ahead and started to play around in desperation, and found that if I use this gather call instead of the one above, I get the full vector:</div><div><br></div><div>    num_pairs = 2*num_pairs+40</div><div>    call mpi_gather(my_dtlrdh, num_pairs, dh_dtype, &</div><div>                     collected_dtlrdh, num_pairs, dh_dtype, &</div><div>                     root_image, intra_image_comm, ierr)</div><div>which I found by just guess and check. While this may work for this small test system, it hardly seems like a scalable solution.</div><div><br></div><div>Some things I have tried: Previously, I only had 87 nonzero elements after the gather... Part of the problem here was that I was using mpi_real in the dh_types instead of mpi_double_precision. The fact that I'm still missing 20 elements implies to me that I have another wrong mpi type. Is mpi_double_precision the proper mpi_type for real(dp)? Is there something else wrong here? And more generally, looking through the interface file in mp.f90, it appears that there's no interfaces for an arbitrary type. Is this true, or is there a recommended way to use this with only the in-built interfaced methods?</div><div><br></div><div>Anyway, I'm totally at a loss... Any ideas? And please, let me know if you guys need any more information from me.</div></div><div><br></div><div>Best,</div><div>Thomas</div><div><br></div></div>