[Pw_forum] Re: Woodcrest vs Opteron performance in pwscf calc.

Konstantin Kudin konstantin_kudin at yahoo.com
Mon Aug 7 18:00:14 CEST 2006

 Hi Huiqun,

 Very interesting results on the performance! And, at this point one
can offer a simplistic explanation for the relative speeds. See my
comments below.

> Here are numbers
> (1) woodcrest (2.66 GHz):
> 1 core : 3m57s (3m55.86s)
> 2 cores: 2m11s (2m10.44s)
> 4 cores: 1m23s (1m17.73s)
> (2) dempsey (3.2 GHz)
> 1 core : (6m26.90s)
> 2 cores: (3m16.47s)
> 4 cores: (1m39.74s)
> (3) opteron 280 (2.6 GHz)
> 1 core : 7m13s (7m09.71s)
> 2 cores: 3m56s (3m52.70s)
> 4 cores: 2m26s (2m16.72s)
> It seems that woodcrest and dempsey are much faster than opteron. The
> scalability of
> dempsey is the best, woodcrest is the worst. Despite of the amazing 
> performance per
> core of woodcrest, it drops to the same level of its predecessor,
> dempsey, 
> when taking
> the machine as a unit to evaluate its performance.

 A while ago Nicola Marzari was doing extensive benchmarking, and I
joined the effort at some point as well. Basically, at the time CP
timings were judged to be BLAS determined. It appears now that the same
thing is applicable to PWSCF. The fastest BLAS library at this moment
is probably the GOTO BLAS.

 Dempsey and Opterons do 2 BLAS operations per cycle, while Woodcrest
does 4. So effectively you get these frequencies for BLAS (per core):
 Woodcrest (4x2.66=10.6), Dempsey (3.2x2=6.4), Opteron ( 2.6x2=5.2).
That is exactly the order you get in terms of performance. Your Opteron
scaling is not too good, which either suggests that there is not enough
memory bandwidth, or you do not have NUMA turned on.

 Now, the theoretical performance would translate into the real world
if the memory is fast enough. I think both Dempsey and Woodcrest use
the same chipset with 2 buses, so earlier memory contention issues with
multiple Intel chips are mostly gone for now. Still, you see that with
4 Woodcrest cores the speedups are worse then for Dempsey, which
suggests that perhaps the optimal purchase for QE would be lower
frequency chips, such as 2.0 or 2.33 Ghz since 4 2.66 Ghz cores are too
fast for the memory.


Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 

More information about the users mailing list