[Pw_forum] questions about intel CPU and vc-relax using bfgs cell optimization

Fri Jun 27 01:18:33 CEST 2008

On Thu, 26 Jun 2008, vega lew wrote:

VL> Dear Axel,
VL> 

VL> First, I wanna express my acknowledgement for your kindly 
VL> responding. From it I learned a lot.

if you want to do me a personal favor. please don't 
use "wanna". i'm quite old fashioned and this kind 
of shortcut writing always makes me cringe (same 
goes for "plz" and "U" and so on). if i spend time
explaining something, i appreciate somebody taking 
the time of writing normal language. i understand
that many people don't think anything about it, 
and given my own refusal to write capital letters
and the many, many typos when i'm tired it sounds
a bit odd, but i cannot help it. thanks.

[...]

VL> And I see 4 process on each node by 'top' command. So I think each 
VL> core has a process. Is there anything I misunderstanding?

i cannot tell. there are too many ways to mess up a cluster
installation. neither 'top' nor 'sar' are good measures
for performance. i would first run some MPI benchmarks/tests 
to make sure everything is set up correctly.

VL> I think I should also use the mpi command like this 'mpiexec -n 20 
VL> pw.x -npool 5 < inputfile > outputfile'

VL> I has tried it. But no obvious improvement, the usage of the CPU 
VL> count by 'sar' command still about 60%, 7% for the system, and the 
VL> rest is idle.

there is something wrong. hard to tell if it is in the way how you
have set up the machine, or compiled your executables or running the
jobs. i just did a series of runs using your input on one of our 
clusters and you can see from that roughly what kind of performance 
and scaling you can expect. you'll find a summary at the end of this 
mail. perhaps it is of use to other people looking for optimal ways
of running QE jobs. the main lesson: it helps to read the documentation
and experiment a little bit.

[...]

VL> QE contains FFTW? Where is it? Does it should be detected by the QE configure?

it is in clib and it is used when FFTW is _not_ detected.

VL> if I compile the FFTW by gcc or don't compile FFTW, QE configure 
VL> can't detect it and show the information about the FFTW in QE 
VL> configure process even I have put the dirs to the environment 
VL> variable. So, I think only the fftw compiled by ifort can be found 
VL> by my QE.

no. fftw is c code and QE uses its own c wrappers so using ifort
has no impact at all.

VL> Do your think intel MKL is needed? When I configured the QE without 
VL> intel MKL, the QE configure also could find the BLAS and LAPACK 
VL> under itself folder.

this is only a matter of last resort. using MKL will speed up 
calculations quite a bit.

VL> You mentioned OMP_NUM_THREADS again. I'm sorry I know little about it.

you have to read the MKL documentation.

VL> Should I use the export command like 'export OMP_NUM_THREADS=1'?
VL> If the command is enough, could you please tell me, when I should 

that will only set it locally. this has nothing to do with compiling
QE, but running it. so it has to be set _always_ when running in 
parallel, _unless_ for some very special cases (gamma point only,
many states, serial executable) where that may actually be faster
than running MPI-parallel.

VL> type it? before configuring the QE? or before using the mpiexec 
VL> command?

whether that is enough or not, i cannot tell. that depends on your
MPI library. you have to read the MPI library documentation how you
export local environment variables. i'm using OpenMPI as mpi library
and i start my parallel jobs with

mpirun -x OMP_NUM_THREADS=1 ...

to set the environment for all MPI tasks.
all MPI packages are different, so you have to read the 
documentation of your MPI package (or use OpenMPI ;-) ).

VL> And could you give me some hints about the optimization of Anatase 
VL> lattice using BFGS schedule? Why the 'cell_dofree = 'xyz'' in &CELL 
VL> section take no effect to aviod lattice angles changing, and result 
VL> in a 'not orthogonal operation' error?

VL> Do you think, I shloud never using BFGS to optimize Anatase lattice? 

well, you got already paolo's recommendation, but i would also like to
point out that this feature is flagged as "experimental"...

in any case, here is the protocol of my test runs of your system.

enjoy,
   axel.

the following tests were run on dual processor AMD opteron 248
nodes with a 2.2GHz clock and 2GB RAM. The machine has a myrinet 2000 
rev04 interconnect and a GigaBit Ethernet with a (pretty crappy) SMC 
gigabit switch and Broadcom BCM5704 ethernet controllers (tg3 driver). 
the machine runs scientific linux 4.x (equivalent to RHEL 4.x).

i've changed the input to do only one SCF calculation. 
that run converges with the given settings in 17 SCF cycles.
timings are taken from the PWSCF line at the end of the job.
performance percentages are taken relative to wall time 
(this is all we care about, after, right?).

first the serial performance of the parallel binary
(a non-parallel compiled binary should be ~10% faster, 
 but i don't have one at hand right now).
   13m31.08s CPU time,    13m37.49s wall time

now running in parallel across the two cpus within one node 
    7m57.19s CPU time,     8m 2.49s wall time ( ~59%)
=>  not perfect scaling, but quite nice

now running across 10 nodes (i.e. 20 cpus) equivalent to
5 nodes with quad-core using the gigabit network:
   12m23.00s CPU time,    35m59.80s wall time (~265%) 
=> OUCH! gigabit and g-space parallization don't mix well.
   not only does it take much more cpu time. there is
   a lot of waiting involved (2/3 of the time).

now using the myrinet instead of gigabit with the same settings:
    5m37.57s CPU time,     5m59.54s wall time ( ~44%)
=> _much_ better. so we didn't waste our money on the fast 
   network but the speedup is only large compared to gigabit.

perhaps we are pushing to hard. lets try less cpus, i.e.
only two nodes (i.e. 4 cpus total).
    5m37.41s CPU time,     5m48.58s wall time ( ~43%)
=> indeed. same speedup with only two nodes.

lets see if it can go faster with twice as many nodes (8 cpus).
    4m19.55s CPU time,     4m27.07s wall time ( ~32%)
=> a bit faster but it looks like we're getting close to the 
   end of the line...

lets see if it can go faster with a few more nodes (12 cpus).
    5m13.68s CPU time,     5m29.59s wall time ( ~40%)
=> no. we're out of luck.

so g-space parallelization scales out even on a fast 
network between 8 and 12 cpus for this system. 

now lets move on to using -npools. again we start with 10 nodes 
of 2 cpu. we have 32 k-points in total, so we can use 20 pools. 
using gigabit first.
     2m32.99s CPU time,     2m48.98s wall time ( ~20%)
=> NICE. almost 5 times faster and the gigabit does not
   hurt us anymore. cool. 

can we go any faster? trying 16-nodes/32 cpus with 32 pools
   2m 8.31s CPU time,     2m25.25s wall time   ( ~18%)
=> a-ha... and  matching the number of pools with the k-points
   makes this optimal.

how about the fast network?
   2m18.56s CPU time,     2m21.71s wall time   ( ~17%)
=> a tiny bit faster, but not much.

can we go even faster? now trying 32 nodes with 32 pools,
i.e. two cpus g-space parallel, gigabit:
    0m50.14s CPU time,     1m 0.91s wall time  (~7.5%)
=> perfect.

and the same with myrinet:
      53.08s CPU time,   55.82s wall time      (~6.8%)
=> even better. too bad our cluster does not have more empty 
   nodes, or else the job might run even faster...

VL> 
VL> thank your again for your so detail responding.
VL> 
VL> Vega Lew
VL> PH.D Candidate in Chemical Engineering
VL> College of Chemistry and Chemical Engineering
VL> Nanjing University of Technology, 210009, Nanjing, Jiangsu, China
VL> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.