[Q-e-developers] Questions to MiniDFT developers

Kalamatianos, John john.kalamatianos at amd.com
Thu Oct 26 20:01:33 CEST 2017


Hi all,

My name is John Kalamatianos and I work at AMD Research with the Path Forward program. I am the technical lead for the work package that targets optimizing CPU chiplets (cores and caches) for high performance and lower energy cost per instruction when executing Exascale workloads. Since Exascale workloads can be accelerated by GPUs as well, my work is focusing on the applications that do not scale or are not accelerated well on GPUs (so the CPUs can be a good alternative) or the ones that have irregular code and memory access patterns and wouldn't be good candidates for running on the GPUs.

In this context, I would like to know the following from the MiniDFT developers:

(1)    Do you think it makes sense to consider MiniDFT as an application that can run on the CPU in an Exascale machine?
(2)    Do you have any preferred compiler optimizations you would want to consider when running MiniDFT on the CPU? The O3 flag enables somewhat different optimizations across compilers but for the most part is the flag that enables aggressive (and relatively stable) optimizations.
(3)    Do you have a preferred linkage option when building MiniDFT to run on the CPU? (Static or dynamic). In the absence of a preference I would choose dynamic linking since that is a common method used in large SW installations.
(4)    In order to evaluate micro-architectural improvements that benefit the proxy apps, my team uses cycle accurate simulations. Those are time consuming so we simulate a small number of HW (up to 4) contexts. My understanding is that the work assigned to a single thread in proxy apps (including MiniDFT) depends on the number of threads enabled by the programming model (e.g. OMP_NUM_THREADS, MPI ranks, etc.). In order to ensure that the workload is representative when we simulate 4 threads, we would like to use inputs for MiniDFT that approximate the same work/thread ratio similar to that seen in a real system running the same proxy application. For that reason, we would like to know what input/problem sizes we should use when simulating a system running the proxy app with low number of threads (1, 2 o 4).

Please forgive me if this not the proper forum for asking these questions but I could not find any direct contacts to engage with,

Thank you,

John Kalamatianos

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20171026/57c96068/attachment.html>


More information about the developers mailing list