[Pw_forum] QE on Xeon Phi : Execution Issue

Fabio Affinito f.affinito at cineca.it
Wed Jul 23 15:25:36 CEST 2014


Dear Nisha,

The page that you mentioned refers to the use of QE using the Automatic Offload mode of Intel MKL. On my experience, this mode is not efficient, with the exception of only a few cases where the size of matrices involved in QE is huge.

At present, there is an effort aimed to release a version of QE that can take more advantage from the offload on the MIC cards. I hope that this version will be released soon after the summer.

An alternative to the offload is the native mode. You can compile QE natively on the MIC architecture simply adding the -mmic flag to the intel compiler (and linking properly the MKL). 

Best,

Fabio


----- Messaggio originale -----
> Da: "Nisha Agrawal" <itlinkstonisha at gmail.com>
> A: "PWSCF Forum" <pw_forum at pwscf.org>
> Inviato: Mercoledì, 23 luglio 2014 14:02:28
> Oggetto: [Pw_forum] QE on Xeon Phi : Execution Issue
> 
> 
> 
> Hi,
> 
> 
> I setup the quantum espresso Intel Xeon Phi version using the
> instruction provided in the following link
> 
> https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor
> 
> 
> However when I was running, its not getting offloaded to Intel Xeon
> PHI , following is the script I am using
> to run QE MIC version. Please let me know If I missed something which
> is required to set or doing somthing
> wrong.
> 
> 
> -------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> source /home/opt/ICS-2013.1.039-intel64/bin/compilervars.sh intel64
> source /home/opt/ICS-2013.1.039-intel64/mkl/bin/mklvars.sh intel64
> source /home/opt/ICS-2013.1.039-intel64/impi/
> 4.1.2.040/bin64/mpivars.sh
> 
> 
> 
> 
> export MKL_MIC_ENABLE=1
> export MKL_DYNAMIC=false
> export MKL_MIC_DISABLE_HOST_FALLBACK=1
> export MIC_LD_LIBRARY_PATH=$MKLROOT/lib/mic:$MIC_LD_LIBRARY_PATH
> 
> 
> export OFFLOAD_DEVICES=0
> 
> 
> 
> 
> export I_MPI_FALLBACK_DEVICE=disable
> export I_MPI_PIN=disable
> export I_MPI_DEBUG=5
> 
> 
> 
> 
> export MKL_MIC_ZGEMM_AA_M_MIN=500
> export MKL_MIC_ZGEMM_AA_N_MIN=500
> export MKL_MIC_ZGEMM_AA_K_MIN=500
> export MKL_MIC_THRESHOLDS_ZGEMM=500,500,500
> 
> 
> 
> 
> export OFFLOAD_REPORT=2
> mpirun -np 8 -perhost 4 ./espresso-5.0.2/bin/pw.x -in ./BN.in 2>&1 |
> tee test.log
> 
> 
> 
> ---------------------------------------------------------------------
> -------------------------------------------------------------------------------
> 
> 
> 
> 
> "Apologizing does not mean that you are wrong and the other one is
> right...
> It simply means that you value the relationship much more than your
> ego.."
> 
> 
> On Mon, Jul 14, 2014 at 8:16 PM, Axel Kohlmeyer < akohlmey at gmail.com
> > wrote:
> 
> 
> 
> On Mon, Jul 14, 2014 at 9:34 AM, Eduardo Menendez <
> eariel99 at gmail.com > wrote:
> > Thank you Axel. Your advise rises another doubt. Can we get the
> > maximum
> > performance from a highly clocked CPU?
> > I used to consider that the the fastest CPUs were too fast for the
> > memory
> > access, resulting in bottlenecks. Of couse it depends on cache
> > size.
> 
> your concern is justified, but the situation is more complex these
> days. highly clocked CPUs have less cores and thus receive a larger
> share of the available memory bandwidth and the highest clocked
> inter-CPU and memory bus is only available for a subset of the CPUs.
> now you have an optimization problem that has to consider the strong
> scaling (or lack thereof) of the code in question as an additional
> input parameter.
> 
> to give an example: we purchased at the same time dual socket nodes
> that had the same mainboard, but either 2x 3.5GHz quad-core or 2x
> 2.8GHz hex-core. the 3.5GHz was the fastest clock available at the
> time. for classical MD, i get better performance out of the 12-core
> nodes, for plane-wave DFT i get about the same performance out of
> both, for CP2k i get better performance with the 8-core (in fact,
> CP2k
> runs fastest on the 12-core with using only 8 cores). now, the cost
> of
> the 2.8GHz CPUs is significantly lower, so that is why we procured
> the
> majority of the cluster with those. but we do have applications that
> scale less than CP2k or are serial, but require high per-core memory
> bandwidth, so we got a few of the 3.5GHz ones, too (and since they
> are
> already expensive we filled them with RAM as much as it doesn't
> result
> in underclocking of the memory bus; and in turn we put "only"
> 1GB/core
> into the 12-core nodes).
> 
> so it all boils down to finding the right balance and adjusting it to
> the application mix that you are running. last time i checked the
> intel spec sheets, it looked as if the best deal was to be had for
> CPUs with the second largest number of CPU cores and as high a clock
> as required to have the full memory bus speed. that will also keep
> the
> heat in check, as the highest clocked CPUs usually have a much higher
> TDP (>50% more) and that is just a much larger demand on cooling and
> power and will incur additional indirect costs as well.
> 
> HTH,
> axel.
> 
> 
> 
> > 
> >>Stick with the cpu. For QE you should be best off with intel. Also
> >>you are
> >> likely to >get the best price/performance ratio with CPUs that
> >> have less
> >> than the maximum >number of cpu cores and a higher clock instead.
> > 
> > 
> > Eduardo Menendez Proupin
> > Departamento de Fisica, Facultad de Ciencias, Universidad de Chile
> > URL: http://www.gnm.cl/emenendez
> > 
> > “Science may be described as the art of systematic
> > oversimplification.” Karl
> > Popper
> > 
> > 
> 
> > _______________________________________________
> > Pw_forum mailing list
> > Pw_forum at pwscf.org
> > http://pwscf.org/mailman/listinfo/pw_forum
> 
> 
> 
> --
> Dr. Axel Kohlmeyer akohlmey at gmail.com http://goo.gl/1wk0
> College of Science & Technology, Temple University, Philadelphia PA,
> USA
> International Centre for Theoretical Physics, Trieste. Italy.
> 
> 
> 
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
> 
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum

-- 
"When you are solving a problem, don’t worry. Now, after you have solved the problem, then that’s the time to worry." Richard Feynman

Fabio Affinito, PhD
SuperComputing Applications and Innovation Department
CINECA - via Magnanelli, 6/3, 40033 Casalecchio di Reno (Bologna) - ITALY
Tel: +39 051 6171794  Fax: +39 051 6132198




More information about the users mailing list