[Q-e-developers] Improvements to the EXX Implementation
Carlo Cavazzoni
c.cavazzoni at cineca.it
Wed Apr 6 17:36:01 CEST 2016
Dear Taylor,
dear all,
since in Cineca we are interested in optimizing EXX computation as well,
I would like to know the input set you have used, or in alternative,
to agree on a new, non trivial, input set to share among all those
interested in profiling and benchmarking EXX computations.
I'm not an expert myself in EXX so I would prefer to receive
a meaningful non trivial input set from someone understanding
this extension,
best,
carlo
Il 05/04/2016 06:05, Taylor Barnes ha scritto:
> Dear All,
>
> I wanted to inform everyone about some improvements that we have
> been making at LBNL to the implementation of exact exchange in QE.
> These improvements have been made as part of NERSC's Exascale Science
> Applications Program, which is an effort to update codes for execution
> on next-generation architectures such as NERSC's upcoming Cori Phase
> II. The following is a brief overview of these changes, which we are
> currently in the process of testing and debugging. Depending on our
> progress, we intend to submit these changes as an addition to either
> QE 5.4 or 6.0.
>
> 1. Parallelization Over Band Pairs
> We have extended the parallelization of subroutine vexx_k such that
> both of the loops over bands (i.e., "LOOP_ON_PSI_BANDS" and
> "IBND_LOOP_K") are parallelized with respect to band groups. This
> improves load balancing, and also enables parallelization using larger
> numbers of band groups than was previously possible
>
> 2. Improved OMP Support
> We have added OMP threading to numerous vector operations within
> exx.f90. In addition, we have given special priority to enhancing the
> threaded performance of the FFTs.
>
> 3. Implementation of Different and Interchangeable Data Layouts for
> Local and EXX Portions of the Calculation
> One observation that we have made is that for calculations that
> utilize many band groups, the local portion of the calculation (i.e.,
> everything outside of exx.f90) often represents a non-negligible (or
> even dominant) contribution to the total cost of the calculation.
> This is largely because the local portion of the calculation is
> duplicated on each band group. We have implemented changes to the
> code that allow the local portion of the code to be parallelized in a
> manner that is independent of the number of band groups, thus avoiding
> duplication of work.
> This is the single most significant modification that we have made,
> both in terms of increasing the efficiency of QE, as well as the
> amount of coding work required. For several test calculations we are
> finding that this change results in more than a factor of two speedup.
> In terms of code development, the primary challenge of our approach
> is that when the EXX part of the calculation is performed (such as
> when vexx is called), we must change the data structure from the one
> that is used by the local portion of the code to a different data
> structure that is used by the EXX portion of the code. This change of
> data structure requires a great deal of bookkeeping in order to update
> arrays like igk, ig_l2g, psi, and hpsi. As a result, we a still the
> process of making our updated code compatible with gamma-point only
> calculations and with calculations that employ multiple k-points.
>
> Sincerely,
> Dr. Taylor Barnes
> Postdoctoral Scholar,
> Lawrence Berkeley National Laboratory
>
>
> _______________________________________________
> Q-e-developers mailing list
> Q-e-developers at qe-forge.org
> http://qe-forge.org/mailman/listinfo/q-e-developers
--
Ph.D. Carlo Cavazzoni
SuperComputing Applications and Innovation Department
CINECA - Via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna)
Tel: +39 051 6171411 Fax: +39 051 6132198
www.cineca.it
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20160406/4ed773c5/attachment.html>
More information about the developers
mailing list