[Pw_forum] Seeking Advice on Small Hardware Platforms for PWscf Implementation

Thu Nov 22 06:13:09 CET 2007

On Mon, 19 Nov 2007, Paul M. Grant wrote:

PG> To All Forum Members:

hi paul,

due to (longish) travel i'm a bit late in this discussion,
but i still want to put in my two rupees (i got plenty of
those these days ;-) ).

PG> I'm planning on building a new Linux box (or boxes) to explore highly
PG> correlated systems (e-p coupling plus LDA+U), and am seeking the collective
PG> experience and advice of the PWscf community on a suitable, inexpensive (<
PG> 2000 USD, MB+CPUs+RAM, exclusive of power supplies, enclosures, and
PG> accessories) hardware platform.  I emphasize that the principal purpose of
PG> this new box would be exploratory, or for development, not production.  

with $2k you have a pretty reduced budget and in the business of
computer hardware, great bargains are extremely rare. you mostly
get what you pay for. however, you can pay in two currencies: money
and your time. so first you have to decide about how much time you
want to spend on putting together the machine and adapting the codes.

assuming, that you don't want to rewrite large parts of QE, i'd
suggest to stick with a PC (multi-core) platform. hacking consoles
with cell and GPUs is out of the question, because of two reasons:
a) they are single precision and b) they work best on (very)
small compute kernels. there is a recent paper by the 
VMD/NAMD crowd on GPU computing with NVIDIA cards and also a
paper on adapting a classical MD code for cell. in both cases
a large speedup was only achieved by using a restricted compute
kernel, that was specifically (re-)written for the hardware in
question and the speedup for more generic problems was less.
i can dig out the references if people are interested in details.
as in most scientific papers these days, one has to carefully
read between the lines to see the (IMNSHO major) problems that
make those platforms of limited use for plane wave pseudopotential 
codes and general multi-feature general purpose codes. for small
analysis tools or utilities with a small compute kernel they are
a good way to improve the performance, as in most cases, you'll
have to adapt/rewrite those codes anyways. having to use single
precision in the vector/GPU units is even worse, as this limits
your accuracy severely and would at best allow you to offload some
scf cycles to the vector/GPU module, but the final convergence
has to be achieved with double precision.

PG> I've built several past platforms, both Windows and Linux based, using
PG> server boards manufactured by Supermicro and have had generally good
PG> experience and service (the factory is only 15 miles from where I live).
PG> Currently, I use two machines with dual Xeon processors (single core, 32
PG> bits), one with 1 GHz cpus, 1 GB RAM, the other 2.4 GHz and 3.25 GB RAM,
PG> both with bus speeds of 133 MHz, the newest 3 years old.  However,
PG> occasionally I run PWscf exercises on my little Thinkpad X41 tablet (single
PG> processor, 1.5 GHz, 1.5 GB), and the scf computation will run 3-5 times
PG> faster than on the other machines!  I suspect this rather surprising result
PG> is because the Thinkpad has a 400 MHz bus clock speed.

for current machines, memory transfer rates _and_ latencies are 
decisive for speed of codes operating on large data sets (as QE).

PG> One option I'm considering is using a "gaming" or server class motherboard
PG> with dual 2.33 GHz quad-core 64-bit processors, a 1333 MHz FSB, and 16 GB
PG> RAM.  Having said this, I'm not sure PWscf (and the Fortran compilers
PG> available) can handle all this parallelism efficiently on a single
PG> motherboard.  I've noticed when running pw.x, the CPU activity "flips"
PG> between processors every several seconds, instead of sharing each at 90-100%
PG> full time.

if you have a new enough kernel, there is a tool called "numactl" where
you can adjust the placement of the cpu. some MPI packages, like OpenMPI
have flags to do this automatically (although they are still not able to
handle the intel quad core properly, when using only half of the cores).
the overhead of having the jobs "jump around" is rather small (~5%).

since memory bandwidth is important, going for the fastest frontside
bus is imperative and a "gaming" board may be the ticket (if you
are willing to experiment, you can slightly overclock and gain a
few % extra). intel quad cores are very competitively priced, so 
even if you don't use all cores (only two instead of), you can a 
significant speed increase, since you double the cpu cache per
MPI task (n.b., the intel quad is actually a dual-dual cpu with
a shared cache for each dual core part). clock rate is not that
important and with your budget, the amount of memory is going to
be the major decider. i.e. i would look for how much memory you
need, add %20 and then buy a good board with an intel quad, that
you can afford. 

PG> On the other hand, one could consider building a small MPI-connected cluster
PG> for about the same amount of money.

i don't think it would be worth it. you can only afford gigabit
ethernet and i'd rather go for a dual cpu "server board" than for
two desktop boards. the dual-quad will give you more flexibility
(you can run a big serial and up to 8 mpi threads). particularly,
for development, this is a desirable setup. in production, things
are different...

PG> When IBM announced a couple of years ago the incredible performance 
PG> details about the Cell processor that would go into Playstation 3, I 
PG> thought, "Wow, maybe the future of computational physics rests with 
PG> gamers."  I'm sure most of you know this is actually beginning to 
PG> happen, spurred on by the fact that the PS3 is "open architecture" 
PG> and can run a Linux distro.  Moreover, there apparently are "open 
PG> software" numerical analysis tools available from IBM.  At least 
PG> four US universities are experimenting with off-the-shelf PS3 
PG> clusters, perhaps one of the more interesting is at UMass, 
PG> http://gravity.phy.umassd.edu/ps3.html.  In the last week or so, 
PG> Sony lowered the entry level price of the PS3 to 400 USD.  So, a 
PG> cluster of four with a cheap switch could be purchased for about the 
PG> same price at the single motherboard configuration I mentioned 
PG> above.

this is primarily (like IBMs blue gene) an exercise is excellent 
marketing. they really learned from p.t. barnum ;-).

not that this is bad hardware, but codes have to be redesigned
to work well on those. this is straightforward for a (simple)
matrix multiplication, but a PITA for a general purpose code,
that is written mostly by people that know very little about
clock cycles, TLBs, cache lines, pipelines, speculative execution,
etc. etc. and care mostly about getting the physics right (which
is a hard enough job already).

PG> My teenager, a gamer, tells me the PS3 has problems.  He says it's
PG> unreliable and overheats and only has 256 MB RAM on board (he owns a Wii,
PG> which outsells the PS3 in the US by a factor of three).
PG> 
PG> Has anybody tried porting PWscf to a PS3?

you can (probably) port it using only the power pc "frontend" cpu core,
but using the rest (and the PS3 does not activate all special purpose
cores to boot) will be hard. i understand that people want to tap into
this, but they don't factor in the amount ot time needed to get it 
_and_ the applications working. this has been tried before and so far, 
the machine (and codes) were usable by the time they were obsolete 
(same is true for special purpose MD cpus).

so, now i've probably destroyed some peoples hopes (or offended
them), but one has to be realistic about what people can do, and
reality is that we are already struggling to inflict enough programming
knowledge on researches, to get their work done with ancient languages
like fortran, where there is tons of experience and teaching material
around. expeciting low-level understanding of machine architecture
is something for a very limited number of geeks, and they don't
usually understand the physics well enough. just look into the
problems most people running into when compiling/running simple
MPI parallel executables (which is quite trivial, all things considered.

best regards and good luck,
   axel.

PG> 
PG> Any and all advice is welcome.
PG> 
PG> Paul M. Grant, PhD
PG> Principal, W2AGZ Technologies
PG> Visiting Scholar, Applied Physics, Stanford University
PG> EPRI Science Fellow (Retired)
PG> IBM Research Staff Member Emeritus
PG> w2agz at pacbell.net
PG> http://www.w2agz.com
PG>  
PG>  
PG> 
PG> 
PG> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.