[Pw_forum] timing vs QE version vs pools

Giovanni Cantele Giovanni.Cantele at na.infn.it
Fri Mar 6 18:15:26 CET 2009


Dear all,

I just finished some timing tests relative to QE, stimulated by recent 
discussions in this forum.
I would like to share these tests, and ask  a couple of simple (?) 
questions.

Up to now, even if not supported by systematic tests, my feeling has 
been that
for the kind of runs I'm doing on the kind of cluster(s) I'm using, the 
following rules hold:

i) if using a relatively "small" number of CPUs, pools are rather 
ineffective, I often realized that
using pools actually increased job execution time

ii) diagonalization parallelism (-ndiag XX) is ineffective as well

iii) version 3.2.3 is "better" than 4.0.4


Bearing in mind that I well understand that this kind of issues are 
often related to the well known problem
arising between the keyboard and the chair, these are my results:


- system: graphene, 4x4 unit cell (only C atoms)
               
                ibrav        = 
4                                                                       
                celldm(1)    = 
18.6                                                                    
                celldm(3)    = 
1.36250                                                                 
                nat          = 
32                                                                      
                ntyp         = 
1                                                                       
                ecutwfc      = 
30.D0                                                                   
                ecutrho      = 
180.D0                                                                   
                                                           
                nosym        = .true.    

                      FFT grid: ( 80, 80,120)
                smooth grid: ( 72, 72, 90)

- computational resources: cluster with dual-core two-processor nodes
  each processor is an Intel(R) Xeon(R) CPU            5160  @ 3.00GHz
  Infiniband connection between nodes
  compiler: ifort (INTEL) 10.1.012
  libraries: mkl 10.0.1.014

My tests concerned all possible combinations of the following parameters:
- # k-points: 32 or 64
- QE version: 3.2.3 or 4.0.4
- ndiag: default or 1 (only for 4.0.4)
- ncpus: 16 or 32
- npools: 1, 2, 4, 8 (just in few cases 16)

-------------------------------------------------------------------- 64 
k points         
npools      CPU time         wall 
time                                                   
-------------------------------------- ncpus: 16  nodes: 4  QE 
3.2.3                     
  1         8m49.79s          9m 
5.89s                                                   
  2         9m45.15s          
9m59.12s                                                   
  4        12m59.18s         
13m14.34s                                                   
  8        16m 2.94s         
16m44.26s                                                   
-------------------------------------- ncpus: 16  nodes: 4  QE 
4.0.4                     
  1        10m51.52s         
11m15.80s                                                   
  2        10m20.63s         
10m43.21s                                                   
  4        13m21.63s         
13m45.10s                                                   
  8        15m59.51s         
16m54.57s                                                   

  1         9m26.57s          9m41.97s -ndiag 1
  2        10m22.68s         10m37.74s        
  4        13m 6.11s         13m25.57s        
  8        16m 3.49s         17m 3.31s        
-------------------------------------- ncpus: 32  nodes: 8  QE 3.2.3
  1        15m36.93s         15m42.65s                             
  2         5m30.32s          5m37.54s                             
  4         5m 7.28s          5m14.20s                             
  8         6m54.27s          7m 1.82s                             
 16         9m 6.85s          9m28.26s                             
-------------------------------------- ncpus: 32  nodes: 8  QE 4.0.4
  1        17m38.71s         17m57.28s                             
  2         9m 7.36s          9m21.41s                             
  4         5m31.61s          5m43.00s                             
  8         7m 1.58s          7m13.86s                             
 16         9m 3.48s          9m33.50s                             
                                                                   
  1        16m 9.52s         16m18.73s -ndiag 1                    
  2         5m49.49s          5m58.75s                             
  4         5m21.05s          5m29.89s                             
  8         6m50.59s          7m 1.20s                             
 16         8m58.71s          9m29.50s                             
--------------------------------------                             
--------------------------------------------------------------------

-------------------------------------------------------------------- 32 
k points
npools      CPU time         wall 
time                                         
-------------------------------------- ncpus: 16  nodes: 4  QE 
3.2.3           
  1         4m44.63s         
4m53.08s                                          
  2         5m11.63s         
5m19.66s                                          
  4         7m 8.90s         
7m17.45s                                          
  8         8m56.42s         
9m18.75s                                          
-------------------------------------- ncpus: 16  nodes: 4  QE 
4.0.4           
  1         5m54.59s         6m 
8.06s                                          
  2         5m46.40s         5m56.81s
  4         7m15.19s         7m28.52s
  8         8m58.10s         9m26.16s

  1         5m 3.20s         5m11.53s -ndiag 1
  2         5m13.45s         5m23.68s
  4         7m 2.97s         7m14.12s
  8         8m55.15s         9m24.76s
-------------------------------------- ncpus: 32  nodes: 8  QE 3.2.3
  1         3m24.75s          3m28.51s
  2         2m34.29s          2m38.63s
  4         2m57.55s          3m 2.09s
  8         4m14.11s          4m19.09s
-------------------------------------- ncpus: 32  nodes: 8  QE 4.0.4
  1         4m36.23s          4m46.02s
  2         3m14.38s          3m20.97s
  4         3m18.11s          3m23.60s
  8         4m26.24s          4m32.57s

  1         3m31.81s          3m37.10s -ndiag 1
  2         2m45.79s          2m50.33s
  4         3m 2.72s          3m 7.82s
  8         4m18.45s          4m24.00s
--------------------------------------
--------------------------------------------------------------------



comments:

- using pools is, in this case, effective only if ncpus=32, I attribute 
this to the fact that the FFT is the true bottleneck,
thus one should first satisfy FFT grid parallelization. However, from 16 
to 32 nodes FFT parallelization is still effective
(maybe due to conflict of cores in the same node if using too large 
resources on that node for ncpus=16), pool parallelization
is effective only with -npool 2

- it seems that QE 3.2.3 always performs a little bit better than 4.0.4, 
any hint on what (if any) is wrong in what I'm doing?

- it seems that -ndiag 1 (serial algorithm for the iterative solution of 
the eigenvalue problem) always performs a little
bit better than the default (code) choice. I attribute this to the fact 
that only for VERY LARGE number of electrons
this may give a difference, is that right? Actually the code "default" 
would lead to this kind of parallelization also
for these runs.


Thank a lot for reading this message....


Giovanni


-- 



Dr. Giovanni Cantele
Coherentia CNR-INFM and Dipartimento di Scienze Fisiche
Universita' di Napoli "Federico II"
Complesso Universitario di Monte S. Angelo - Ed. 6
Via Cintia, I-80126, Napoli, Italy
Phone: +39 081 676910
Fax:   +39 081 676346
E-mail: giovanni.cantele at cnr.it
        giovanni.cantele at na.infn.it
Web: http://people.na.infn.it/~cantele
Research Group: http://www.nanomat.unina.it




More information about the users mailing list