[Q-e-developers] Using MPI_Gather in QE with Derived Datatype

Thu Mar 19 13:07:52 CET 2015

Hi All,

I'm currently working to write a many body dispersion vdW energy and
gradients module into QE, and I have it working in serial but it's quite
slow so I figured parallelizing it was the way to go. Unfortunately, I'm
running into an issue where MPI_GATHER isn't syncing the full vector. I've
outlined the key parts of my code below, and I'd love some feedback or
ideas of next things to try.

The time-intensive part of the computation is the construction of many
dipole-dipole tensors. Fortunately, these can be precomputed and stored in
a lookup table, because there are many degeneracies. Naturally, there are
many indexing problems associated with this so, to keep track of
everything, I've constructed a derived data type like so:

    type dtlrdh_lut
        sequence
        integer p
        integer q
        integer ind
        real(dp), dimension(3, 3) :: TLR
        real(dp), dimension(3, 3, 3, 3) :: dTLRdh
    end type dtlrdh_lut

note, TLR is the long range dipole tensor, dTLRdh is its derivative with
respect to cell parameters, p and q are atom parameters, and ind is the
unique identifier. I defined my variables in my subroutine like so:

        type(dtlrdr_lut), dimension(:), allocatable :: my_dtlrdr,
collected_dtlrdr
        type(dtlrdh_lut), dimension(:), allocatable :: my_dtlrdh,
collected_dtlrdh
        integer :: dh_dtype, dr_dtype, dh_types(5), dr_types(6),
dh_blocks(5), dr_blocks(6)
        INTEGER(kind=MPI_ADDRESS_KIND) :: dh_offsets(5), dr_offsets(6)
        integer :: numtasks, rank, ierr, dh_displs(nproc_image),
dr_displs(nproc_image)
        integer :: n, status(mpi_status_size)

I split up the work between processes in another method and then count the
number of elements of the lookup table need to be computed and allocated
the local lookup tables on this specific node like so:
 my_num_pairs = 0
    do i = 1, num_pairs, 1
        if(unique_pairs(i)%cpu.eq.me_image) then
            my_num_pairs = my_num_pairs + 1
            end if
        end do
    if(.not.allocated(my_dtlrdh))    allocate(my_dtlrdh(my_num_pairs))
I also allocate and zero out the lookup table that everything will be
combined back into with the following code: if(me_image.eq.root_image) then
if(.not.allocated(collected_dtlrdh)) allocate(collected_dtlrdh(num_pairs))

        do i=1,my_num_pairs,1
            collected_dtlrdh(i)%p = 0
            collected_dtlrdh(i)%q = 0
            collected_dtlrdh(i)%ind = 0
            collected_dtlrdh(i)%TLR = 0.0_DP
            collected_dtlrdh(i)%dTLRdh = 0.0_DP
            end do
        end if
I then fill in the lookup table, but I'll skip that code. It's long and not
relevant. With this done, it's time to start the MPI process to gather all
back on the root process.

    call mpi_get_address(my_dtlrdh(1)%p,               dh_offsets(1), ierr)
    call mpi_get_address(my_dtlrdh(1)%q,               dh_offsets(2), ierr)
    call mpi_get_address(my_dtlrdh(1)%ind,             dh_offsets(3), ierr)
    call mpi_get_address(my_dtlrdh(1)%TLR(1,1),        dh_offsets(4), ierr)
    call mpi_get_address(my_dtlrdh(1)%dTLRdh(1,1,1,1), dh_offsets(5), ierr)
    do i = 2, size(dh_offsets)
      dh_offsets(i) = dh_offsets(i) - dh_offsets(1)
    end do
    dh_offsets(1) = 0
    dh_types = (/MPI_INTEGER, MPI_INTEGER, MPI_INTEGER,
MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION/)
    dh_blocks = (/1, 1, 1, 3*3, 3*3*3*3/)
    call mpi_type_struct(5, dh_blocks, dh_offsets, dh_types, dh_dtype, ierr)
    call mpi_type_commit(dh_dtype, ierr)
I then call gather via:

    call mpi_gather(my_dtlrdh, my_num_pairs, dh_dtype, &
                     collected_dtlrdh, my_num_pairs, dh_dtype, &
                     root_image, intra_image_comm, ierr)
After I gather, I can then print out what everything should look like:

    do i = 1, num_pairs, 1
        write(stdout, *) my_dtlrdh(i)%p, collected_dtlrdh(i)%p,
my_dtlrdh(i)%q, collected_dtlrdh(i)%q
        end do
and this is where I see really strange information. The first few elements
that are printed out look fine:

       1           1           3           3
       1           1           6           6
       1           1           9           9
But the tail end of my vector looks like where I only send 174 elements
instead of the full 194:

      17           0          24           0
      18           0          19           0
      18           0          20           0
      18           0          21           0
      18           0          22           0
Given that there are 194 pairs, and both num_pairs and my_num_pairs equal
194, I'm confused. I went ahead and started to play around in desperation,
and found that if I use this gather call instead of the one above, I get
the full vector:

    num_pairs = 2*num_pairs+40
    call mpi_gather(my_dtlrdh, num_pairs, dh_dtype, &
                     collected_dtlrdh, num_pairs, dh_dtype, &
                     root_image, intra_image_comm, ierr)
which I found by just guess and check. While this may work for this small
test system, it hardly seems like a scalable solution.

Some things I have tried: Previously, I only had 87 nonzero elements after
the gather... Part of the problem here was that I was using mpi_real in the
dh_types instead of mpi_double_precision. The fact that I'm still missing
20 elements implies to me that I have another wrong mpi type. Is
mpi_double_precision the proper mpi_type for real(dp)? Is there something
else wrong here? And more generally, looking through the interface file in
mp.f90, it appears that there's no interfaces for an arbitrary type. Is
this true, or is there a recommended way to use this with only the in-built
interfaced methods?

Anyway, I'm totally at a loss... Any ideas? And please, let me know if you
guys need any more information from me.

Best,
Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20150319/873e1254/attachment.html>