[Pw_forum] timing vs QE version vs pools

Giovanni Cantele Giovanni.Cantele at na.infn.it
Tue Mar 10 16:47:24 CET 2009

>> - it seems that QE 3.2.3 always performs a little bit better than
>> 4.0.4,
>> any hint on what (if any) is wrong in what I'm doing?
> assuming that the two versions are compiled with the same options
> and libraries: please check where the difference come from, in the
> cpu time report at the end of each calculation

Let's consider one of the tests, namely 16 cpus, 64 k-points, 1 pool (in 
the case of 4.0.4 I used -ndiag 1 to get rid of any effect/difference 
coming from the diagonalization parallelism).

These are the relevant (that is, showing the largest differences) results:
3.2.3 time / 4.0.4 time    4.0.4 - 3.2.3 difference

total CPU time: 8m49.79s / 9m26.57s    36.78
init_run	44.22s   / 42.89s      -1.33
electrons	485.50s	 / 522.73s     37.23
c_bands	        426.96s  / 464.64s     37.68
cegterg	        422.70s  / 458.23s     35.53
h_psi           286.17s  / 299.63s     13.46
diaghg           59.16s	 / 64.07s       4.91					
cft3s           240.42s  / 245.56s      5.14
fft_scatter	142.73s  / 124.84s    -17.89						

so, it seems that the main difference is just in the 
diagonalization-related routines, right?

>> - it seems that -ndiag 1 (serial algorithm for the iterative
>> solution of
>> the eigenvalue problem) always performs a little bit better than the
>> default (code) choice. I attribute this to the fact that only for VERY
>> LARGE number of electrons this may give a difference, is that right?
> VERY LARGE maybe not, but you will gain (or lose) very little unless
> you have let's say several hundreds electronic states

I'll make, asap, more tests, if I do lose little it is ok. In the above 
test, turning "off" parallel diag. gave 9m26.57s CPU time,
against 10m51.52s (ortho sub-group =    4*   4 procs), which is 84s 
faster, namely ~ 10%. btw, in this case using 3.2.3 gave 8m49.79s (see 
above), which is a further 6% gain.

In understand that the gaining with the number of electrons would 
increase very fast, but if you can imagin just this test running for one
day, the difference may become relevant.

Can it be due to wrong settings of my cluster?

I compiled both 3.2.3 and 4.0.4 using the same compiler (I never changed 
it since the 1st installation of the machine), libraries, etc.
The make.sys is generated in both cases using the configure script of 
the corresponding version. The only difference is that in one case 
(3.2.3) I turned on the wannier library (-D__WANLIB) and that a "wrong" line
LDFLAGS        = static -openmp
overwrites the above one
LDFLAGS        = -i-static -openmp
which instead is correctly reported for 4.0.4.



Dr. Giovanni Cantele
Coherentia CNR-INFM and Dipartimento di Scienze Fisiche
Universita' di Napoli "Federico II"
Complesso Universitario di Monte S. Angelo - Ed. 6
Via Cintia, I-80126, Napoli, Italy
Phone: +39 081 676910
Fax:   +39 081 676346
E-mail: giovanni.cantele at cnr.it
         giovanni.cantele at na.infn.it
Web: http://people.na.infn.it/~cantele
Research Group: http://www.nanomat.unina.it

More information about the users mailing list