[Pw_forum] Re: Woodcrest vs Opteron performance in pwscf calc.
akohlmey at vitae.cmm.upenn.edu
Wed Aug 2 14:07:34 CEST 2006
On Wed, 2 Aug 2006, Alexander Shaposhnikov wrote:
AS> Thanks for the answer.
AS> I don't think this topic is relevant to the PW_forum goals.
actually, it is. some people here spend a lot of money
on new machines and figuring out what is the best deal
for a specific application needs a lot of testing. so
every contribution is important.
AS> time to move on to the new platform. But my experience with the prev.
AS> generation of Intel processors showed, that Opteron is faster is most
AS> cases, especially then it comes to multi-threaded calculations.
really?? are you talking about OpenMP multi-threaded or explicit
multi-threaded? in my experience so far, OpenMP on an opteron system
was a serious letdown, and it was usually much better to use
MPI parallelism, even within the nodes.
AS> > It seems that woodcrest and dempsey are much faster than opteron. The
AS> > scalability of
AS> > dempsey is the best, woodcrest is the worst. Despite of the amazing
AS> > performance per
AS> > core of woodcrest, it drops to the same level of its predecessor, dempsey,
AS> > when taking
AS> > the machine as a unit to evaluate its performance.
one thing to check when using intel MKL is, whether it is running in
multi-threaded mode and thus getting better results on a 'half-loaded'
machines. for that, you may want to re-run the jobs with the
environment variable OMP_NUM_THREADS set to 1. secondly, memory
contention is a problem, so it would be interesting to see the
performance, if you run 4 serial jobs at the same time.
AS> > But remember one thing: the number for opteron may not be fair. I compiled
AS> > the program
AS> > using Intel fortran, Intel MPI 2.0. However, I ever used both Intel and
AS> > PathScale to
AS> > compile FFTW and its test cases on opteron machine, I didn't find any
AS> > impressive
AS> > differences.
the intel compiler ususally does a good job on opteron. especially,
since for floating point intensive jobs don't benefit a lot if at
all from using the atomated vectorization with SSE. usually get
the best performance on opteron and P4 with '-O2 -tpp6 -unroll'
not using any vectorization. that however is a different story
when it comes to BLAS/LAPACK: using ACML > 2.7 is essential to
get good performance on dual-core opteron machines.
there is a way to make (the gcc) ACML compatible with the intel
compiler (at least for packages that use only double precision
it would be nice to see, how using ACML would affect
the performance in this case.
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
If you make something idiot-proof, the universe creates a better idiot.
More information about the users