[Pw_forum] QE on Xeon Phi : Execution Issue

Thu Jul 24 08:23:38 CEST 2014

Dear Fabio Affinito

Thank you so much for information.

"Apologizing does not mean that you are wrong and the other one is right...
It simply means that you value the relationship much more than your ego.."

On Wed, Jul 23, 2014 at 5:32 PM, Nisha Agrawal <itlinkstonisha at gmail.com>
wrote:

> Hi,
>
> I setup the quantum espresso Intel Xeon Phi version using the instruction
> provided in the following link
>
>
> https://software.intel.com/en-us/articles/quantum-espresso-for-intel-xeon-phi-coprocessor
>
>
> However when I was running, its not getting offloaded to Intel Xeon PHI ,
> following is the script I am using
> to run QE MIC version. Please let me know If I missed something which is
> required to set or doing somthing
> wrong.
>
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------
> source /home/opt/ICS-2013.1.039-intel64/bin/compilervars.sh intel64
> source /home/opt/ICS-2013.1.039-intel64/mkl/bin/mklvars.sh intel64
> source /home/opt/ICS-2013.1.039-intel64/impi/4.1.2.040/bin64/mpivars.sh
>
> export MKL_MIC_ENABLE=1
> export MKL_DYNAMIC=false
> export MKL_MIC_DISABLE_HOST_FALLBACK=1
> export MIC_LD_LIBRARY_PATH=$MKLROOT/lib/mic:$MIC_LD_LIBRARY_PATH
>
> export OFFLOAD_DEVICES=0
>
> export I_MPI_FALLBACK_DEVICE=disable
> export I_MPI_PIN=disable
> export I_MPI_DEBUG=5
>
>
> export MKL_MIC_ZGEMM_AA_M_MIN=500
> export MKL_MIC_ZGEMM_AA_N_MIN=500
> export MKL_MIC_ZGEMM_AA_K_MIN=500
> export MKL_MIC_THRESHOLDS_ZGEMM=500,500,500
>
>
> export OFFLOAD_REPORT=2
> mpirun  -np 8 -perhost 4  ./espresso-5.0.2/bin/pw.x   -in  ./BN.in 2>&1 |
> tee test.log
>
> ---------------------------------------------------------------------
>
> -------------------------------------------------------------------------------
>
>
> "Apologizing does not mean that you are wrong and the other one is right...
> It simply means that you value the relationship much more than your ego.."
>
>
> On Mon, Jul 14, 2014 at 8:16 PM, Axel Kohlmeyer <akohlmey at gmail.com>
> wrote:
>
>> On Mon, Jul 14, 2014 at 9:34 AM, Eduardo Menendez <eariel99 at gmail.com>
>> wrote:
>> > Thank you Axel. Your advise rises another doubt. Can we get the maximum
>> > performance from a highly clocked CPU?
>> > I used to consider that the the fastest CPUs were too fast for the
>> memory
>> > access, resulting in bottlenecks. Of couse it depends on cache size.
>>
>> your concern is justified, but the situation is more complex these
>> days. highly clocked CPUs have less cores and thus receive a larger
>> share of the available memory bandwidth and the highest clocked
>> inter-CPU and memory bus is only available for a subset of the CPUs.
>> now you have an optimization problem that has to consider the strong
>> scaling (or lack thereof) of the code in question as an additional
>> input parameter.
>>
>> to give an example: we purchased at the same time dual socket nodes
>> that had the same mainboard, but either 2x 3.5GHz quad-core or 2x
>> 2.8GHz hex-core. the 3.5GHz was the fastest clock available at the
>> time. for classical MD, i get better performance out of the 12-core
>> nodes, for plane-wave DFT i get about the same performance out of
>> both, for CP2k i get better performance with the 8-core (in fact, CP2k
>> runs fastest on the 12-core with using only 8 cores). now, the cost of
>> the 2.8GHz CPUs is significantly lower, so that is why we procured the
>> majority of the cluster with those. but we do have applications that
>> scale less than CP2k or are serial, but require high per-core memory
>> bandwidth, so we got a few of the 3.5GHz ones, too (and since they are
>> already expensive we filled them with RAM as much as it doesn't result
>> in underclocking of the memory bus; and in turn we put "only" 1GB/core
>> into the 12-core nodes).
>>
>> so it all boils down to finding the right balance and adjusting it to
>> the application mix that you are running. last time i checked the
>> intel spec sheets, it looked as if the best deal was to be had for
>> CPUs with the second largest number of CPU cores and as high a clock
>> as required to have the full memory bus speed. that will also keep the
>> heat in check, as the highest clocked CPUs usually have a much higher
>> TDP (>50% more) and that is just a much larger demand on cooling and
>> power and will incur additional indirect costs as well.
>>
>> HTH,
>>     axel.
>>
>>
>> >
>> >>Stick with the cpu. For QE you should be best off with intel. Also you
>> are
>> >> likely to >get the best price/performance ratio with CPUs that have
>> less
>> >> than the maximum >number of cpu cores and a higher clock instead.
>> >
>> >
>> > Eduardo Menendez Proupin
>> > Departamento de Fisica, Facultad de Ciencias, Universidad de Chile
>> > URL: http://www.gnm.cl/emenendez
>> >
>> > "Science may be described as the art of systematic oversimplification."
>> Karl
>> > Popper
>> >
>> >
>> > _______________________________________________
>> > Pw_forum mailing list
>> > Pw_forum at pwscf.org
>> > http://pwscf.org/mailman/listinfo/pw_forum
>>
>>
>>
>> --
>> Dr. Axel Kohlmeyer  akohlmey at gmail.com  http://goo.gl/1wk0
>> College of Science & Technology, Temple University, Philadelphia PA, USA
>> International Centre for Theoretical Physics, Trieste. Italy.
>>
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://pwscf.org/mailman/listinfo/pw_forum
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20140724/401aee63/attachment.html>