<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle18
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:windowtext;}
span.EmailStyle19
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:#1F497D;}
span.EmailStyle20
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:774012843;
mso-list-type:hybrid;
mso-list-template-ids:-1544893080 594059298 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
{mso-level-text:"\(%1\)";
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level2
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal"><span style="color:#1F497D">Hi all,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal">My name is John Kalamatianos and I work at AMD Research with the Path Forward program. I am the technical lead for the work package that targets optimizing CPU chiplets (cores and caches) for high performance and lower energy cost per instruction
when executing Exascale workloads. Since Exascale workloads can be accelerated by GPUs as well, my work is focusing on the applications that do not scale or are not accelerated well on GPUs (so the CPUs can be a good alternative) or the ones that have irregular
code and memory access patterns and wouldn’t be good candidates for running on the GPUs.<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">In this context, I would like to know the following from the MiniDFT developers:<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in;text-indent:-.25in;mso-list:l0 level1 lfo2">
<![if !supportLists]><span style="mso-list:Ignore">(1)<span style="font:7.0pt "Times New Roman"">
</span></span><![endif]>Do you think it makes sense to consider MiniDFT as an application that can run on the CPU in an Exascale machine?<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in;text-indent:-.25in;mso-list:l0 level1 lfo2">
<![if !supportLists]><span style="mso-list:Ignore">(2)<span style="font:7.0pt "Times New Roman"">
</span></span><![endif]>Do you have any preferred compiler optimizations you would want to consider when running MiniDFT on the CPU? The O3 flag enables somewhat different optimizations across compilers but for the most part is the flag that enables aggressive
(and relatively stable) optimizations. <o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in;text-indent:-.25in;mso-list:l0 level1 lfo2">
<![if !supportLists]><span style="mso-list:Ignore">(3)<span style="font:7.0pt "Times New Roman"">
</span></span><![endif]>Do you have a preferred linkage option when building MiniDFT to run on the CPU? (Static or dynamic). In the absence of a preference I would choose dynamic linking since that is a common method used in large SW installations.<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in;text-indent:-.25in;mso-list:l0 level1 lfo2">
<![if !supportLists]><span style="mso-list:Ignore">(4)<span style="font:7.0pt "Times New Roman"">
</span></span><![endif]>In order to evaluate micro-architectural improvements that benefit the proxy apps, my team uses cycle accurate simulations. Those are time consuming so we simulate a small number of HW (up to 4) contexts. My understanding is that the
work assigned to a single thread in proxy apps (including MiniDFT) depends on the number of threads enabled by the programming model (e.g. OMP_NUM_THREADS, MPI ranks, etc.). In order to ensure that the workload is representative when we simulate 4 threads,
we would like to use inputs for MiniDFT that approximate the same work/thread ratio similar to that seen in a real system running the same proxy application. For that reason, we would like to know what input/problem sizes we should use when simulating a system
running the proxy app with low number of threads (1, 2 o 4).<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Please forgive me if this not the proper forum for asking these questions but I could not find any direct contacts to engage with,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thank you,<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">John Kalamatianos<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>