[Q-e-developers] Improvements to the EXX Implementation

Wed Apr 6 17:36:01 CEST 2016

Dear Taylor,
dear all,

since in Cineca we are interested in optimizing EXX computation as well,
I would like to know the input set you have used, or in alternative,
to agree on a new, non trivial, input set to share among all those
interested in profiling and benchmarking EXX computations.
I'm not an expert myself in EXX so I would prefer to receive
a meaningful non trivial input set from someone understanding
this extension,

best,
carlo

Il 05/04/2016 06:05, Taylor Barnes ha scritto:
> Dear All,
>
>    I wanted to inform everyone about some improvements that we have 
> been making at LBNL to the implementation of exact exchange in QE.  
> These improvements have been made as part of NERSC's Exascale Science 
> Applications Program, which is an effort to update codes for execution 
> on next-generation architectures such as NERSC's upcoming Cori Phase 
> II.  The following is a brief overview of these changes, which we are 
> currently in the process of testing and debugging.  Depending on our 
> progress, we intend to submit these changes as an addition to either 
> QE 5.4 or 6.0.
>
> 1. Parallelization Over Band Pairs
>    We have extended the parallelization of subroutine vexx_k such that 
> both of the loops over bands (i.e., "LOOP_ON_PSI_BANDS" and 
> "IBND_LOOP_K") are parallelized with respect to band groups.  This 
> improves load balancing, and also enables parallelization using larger 
> numbers of band groups than was previously possible
>
> 2. Improved OMP Support
>    We have added OMP threading to numerous vector operations within 
> exx.f90.  In addition, we have given special priority to enhancing the 
> threaded performance of the FFTs.
>
> 3. Implementation of Different and Interchangeable Data Layouts for 
> Local and EXX Portions of the Calculation
>    One observation that we have made is that for calculations that 
> utilize many band groups, the local portion of the calculation (i.e., 
> everything outside of exx.f90) often represents a non-negligible (or 
> even dominant) contribution to the total cost of the calculation.  
> This is largely because the local portion of the calculation is 
> duplicated on each band group.  We have implemented changes to the 
> code that allow the local portion of the code to be parallelized in a 
> manner that is independent of the number of band groups, thus avoiding 
> duplication of work.
>    This is the single most significant modification that we have made, 
> both in terms of increasing the efficiency of QE, as well as the 
> amount of coding work required.  For several test calculations we are 
> finding that this change results in more than a factor of two speedup.
>    In terms of code development, the primary challenge of our approach 
> is that when the EXX part of the calculation is performed (such as 
> when vexx is called), we must change the data structure from the one 
> that is used by the local portion of the code to a different data 
> structure that is used by the EXX portion of the code.  This change of 
> data structure requires a great deal of bookkeeping in order to update 
> arrays like igk, ig_l2g, psi, and hpsi.  As a result, we a still the 
> process of making our updated code compatible with gamma-point only 
> calculations and with calculations that employ multiple k-points.
>
> Sincerely,
> Dr. Taylor Barnes
> Postdoctoral Scholar,
> Lawrence Berkeley National Laboratory
>
>
> _______________________________________________
> Q-e-developers mailing list
> Q-e-developers at qe-forge.org
> http://qe-forge.org/mailman/listinfo/q-e-developers

-- 
Ph.D. Carlo Cavazzoni
SuperComputing Applications and Innovation Department
CINECA - Via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna)
Tel: +39 051 6171411  Fax: +39 051 6132198
www.cineca.it

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20160406/4ed773c5/attachment.html>