[Q-e-developers] Improvements to the EXX Implementation

Tue Apr 5 06:05:34 CEST 2016

Dear All,

   I wanted to inform everyone about some improvements that we have been
making at LBNL to the implementation of exact exchange in QE.  These
improvements have been made as part of NERSC's Exascale Science
Applications Program, which is an effort to update codes for execution on
next-generation architectures such as NERSC's upcoming Cori Phase II.  The
following is a brief overview of these changes, which we are currently in
the process of testing and debugging.  Depending on our progress, we intend
to submit these changes as an addition to either QE 5.4 or 6.0.

1. Parallelization Over Band Pairs
   We have extended the parallelization of subroutine vexx_k such that both
of the loops over bands (i.e., "LOOP_ON_PSI_BANDS" and "IBND_LOOP_K") are
parallelized with respect to band groups.  This improves load balancing,
and also enables parallelization using larger numbers of band groups than
was previously possible

2. Improved OMP Support
   We have added OMP threading to numerous vector operations within
exx.f90.  In addition, we have given special priority to enhancing the
threaded performance of the FFTs.

3. Implementation of Different and Interchangeable Data Layouts for Local
and EXX Portions of the Calculation
   One observation that we have made is that for calculations that utilize
many band groups, the local portion of the calculation (i.e., everything
outside of exx.f90) often represents a non-negligible (or even dominant)
contribution to the total cost of the calculation.  This is largely because
the local portion of the calculation is duplicated on each band group.  We
have implemented changes to the code that allow the local portion of the
code to be parallelized in a manner that is independent of the number of
band groups, thus avoiding duplication of work.
   This is the single most significant modification that we have made, both
in terms of increasing the efficiency of QE, as well as the amount of
coding work required.  For several test calculations we are finding that
this change results in more than a factor of two speedup.
   In terms of code development, the primary challenge of our approach is
that when the EXX part of the calculation is performed (such as when vexx
is called), we must change the data structure from the one that is used by
the local portion of the code to a different data structure that is used by
the EXX portion of the code.  This change of data structure requires a
great deal of bookkeeping in order to update arrays like igk, ig_l2g, psi,
and hpsi.  As a result, we a still the process of making our updated code
compatible with gamma-point only calculations and with calculations that
employ multiple k-points.

Sincerely,
Dr. Taylor Barnes
Postdoctoral Scholar,
Lawrence Berkeley National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20160404/eb4f4ca6/attachment.html>