<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Dear Taylor,<br>

      dear all,<br>

      <br>

      since in Cineca we are interested in optimizing EXX computation as

      well,<br>

      I would like to know the input set you have used, or in

      alternative,<br>

      to agree on a new, non trivial, input set to share among all those<br>

      interested in profiling and benchmarking EXX computations.<br>

      I'm not an expert myself in EXX so I would prefer to receive<br>

      a meaningful non trivial input set from someone understanding<br>

      this extension,<br>

      <br>

      best,<br>

      carlo<br>

      <br>

      Il 05/04/2016 06:05, Taylor Barnes ha scritto:<br>

    </div>

    <blockquote

cite="mid:CAPD_qRmr070PMobmVXwHeoDQEd+8GoKUFjHUF5=1hgruLTUQPQ@mail.gmail.com"

      type="cite">

      <div dir="ltr">Dear All,<br>

        <br>

           I wanted to inform everyone about some improvements that we

        have been making at LBNL to the implementation of exact exchange

        in QE.  These improvements have been made as part of NERSC's

        Exascale Science Applications Program, which is an effort to

        update codes for execution on next-generation architectures such

        as NERSC's upcoming Cori Phase II.  The following is a brief

        overview of these changes, which we are currently in the process

        of testing and debugging.  Depending on our progress, we intend

        to submit these changes as an addition to either QE 5.4 or 6.0.<br>

        <br>

        1. Parallelization Over Band Pairs<br>

           We have extended the parallelization of subroutine vexx_k

        such that both of the loops over bands (i.e.,

        "LOOP_ON_PSI_BANDS" and "IBND_LOOP_K") are parallelized with

        respect to band groups.  This improves load balancing, and also

        enables parallelization using larger numbers of band groups than

        was previously possible<br>

        <br>

        2. Improved OMP Support<br>

           We have added OMP threading to numerous vector operations

        within exx.f90.  In addition, we have given special priority to

        enhancing the threaded performance of the FFTs.<br>

        <br>

        3. Implementation of Different and Interchangeable Data Layouts

        for Local and EXX Portions of the Calculation<br>

           One observation that we have made is that for calculations

        that utilize many band groups, the local portion of the

        calculation (i.e., everything outside of exx.f90) often

        represents a non-negligible (or even dominant) contribution to

        the total cost of the calculation.  This is largely because the

        local portion of the calculation is duplicated on each band

        group.  We have implemented changes to the code that allow the

        local portion of the code to be parallelized in a manner that is

        independent of the number of band groups, thus avoiding

        duplication of work.<br>

           This is the single most significant modification that we have

        made, both in terms of increasing the efficiency of QE, as well

        as the amount of coding work required.  For several test

        calculations we are finding that this change results in more

        than a factor of two speedup.<br>

           In terms of code development, the primary challenge of our

        approach is that when the EXX part of the calculation is

        performed (such as when vexx is called), we must change the data

        structure from the one that is used by the local portion of the

        code to a different data structure that is used by the EXX

        portion of the code.  This change of data structure requires a

        great deal of bookkeeping in order to update arrays like igk,

        ig_l2g, psi, and hpsi.  As a result, we a still the process of

        making our updated code compatible with gamma-point only

        calculations and with calculations that employ multiple

        k-points.<br>

        <br>

        <span><span>Sincerely,<br>

          </span></span><span><span>Dr. Taylor Barnes<br>

          </span></span>

        <div><span><span>Postdoctoral Scholar,<br>

            </span></span></div>

        <span><span>Lawrence Berkeley National Laboratory</span></span><br>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Q-e-developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Q-e-developers@qe-forge.org">Q-e-developers@qe-forge.org</a>

<a class="moz-txt-link-freetext" href="http://qe-forge.org/mailman/listinfo/q-e-developers">http://qe-forge.org/mailman/listinfo/q-e-developers</a>

</pre>

    </blockquote>

    <br>

    <br>

    <pre class="moz-signature" cols="72">-- 

Ph.D. Carlo Cavazzoni

SuperComputing Applications and Innovation Department

CINECA - Via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna)

Tel: +39 051 6171411  Fax: +39 051 6132198

<a class="moz-txt-link-abbreviated" href="http://www.cineca.it">www.cineca.it</a></pre>

  </body>

</html>