[Pw_forum] How to using gamma point calculation with high efficiency

Sun Jun 29 13:40:51 CEST 2008

Dear Vega Lew,

>I compile Q-E on my cluster with 10.1.015 vision of intel compilers
>successfully and correctly. Now my cluster can calculate  
>very fast when calculating the structure relaxtion with 30-40 k-points. But
>on my cluster which has 5 quad-core CPU, I must  
>using 20 pools to get the highest CPU usage (most of time 90%+, but it's
>unstable.  70%+ in average was shown by 'sar'  
>command). 

Does this mean, that you have single socket quad core CPU/machines, and you 
have 5 nodes in your cluster? What is the interconnect between the machines?

I'm a novice user of PWSCF by myself,  but as far as i understood (actually it 
is made relative clear in the manual) if you do a calculation with K-points, 
than you have two ways of parallelization: One over the G-space, and one 
additional over k-points. I show you my own example how this cold be 
exploited:

I use dual socket Xeon machines, each CPU have 4 CPU cores (8 CPU 
cores/machine). The nodes (machines) are communicating via Gigabit LAN.
The mesh in the z direction of my supercell is having 180 points.
If you look into your output file, at the beginning you will see something 
like:

     Planes per process (thick) : nr3 =180 npp =  30 ncplane = 6400

 Proc/  planes cols    G   planes cols    G    columns  G
 Pool       (dense grid)      (smooth grid)   (wavefct grid)
  1     30    711  83109   30    711  83109  189  11381
  2     30    711  83111   30    711  83111  189  11379
  3     30    711  83109   30    711  83109  190  11386
  4     30    711  83111   30    711  83111  189  11377
  5     30    712  83114   30    712  83114  189  11381
  6     30    711  83107   30    711  83107  189  11377
  0    180   4267 498661  180   4267 498661 1135  68281

This means that i use the parallelization over the G-space with 6 CPU cores 
(this is dividing 180 into 30) in each machine. Each CPU core is calculating 
30 z-planes. The communication is more crucial here than between K-points, so 
it would be good if you look it up in your output how many CPU cores you can 
use from a single machine to be appropriate for your mesh.

My supercell is relative large, so with six K-points the binding energy 
for my setup is well converged. 
Therefore, i set npool=3. This mean, that i use the parallelization over the 
K-points (over three separate machines, using from each machine 6 CPU for the 
G-space parallelization). The Gigabit LAN is acceptable for the communication 
what you have between the machines for K-point parallelization.

>Thanks to Axel's advices, I set the environmental variable 
>OMP_NUM_THREADS=1. The CPU usage in every 5 computers was always the same
>case. The calculation can be achieved  fast.  
>If I using 10 or 5 pools the CPU usage can't reach that high. Is this up to 
>snuff?

Now please have a look on what should be the right value for your setup for 
G-space/CPU cores and K-points/machines. Hopefully your 'CPU usage' gonna get 
better.

>After testing the lattice optimizations, another questions rises. I need to
>calculate the surface structure with gamma point only, because of the system 
>composed of ~80 atoms ( scientists always calculate gamma point optimization
>in my area of researching). 

This depends on your system again (and that quantity what you are interested 
in). You have to keep in mind, that doing calculations with proper K-point 
sampling can be several times more expensive than a gamma point only 
calculations.
If you get good enough results with Gamma point only with a larger supercell, 
than it is a happy case. If your system does not have a wide-band gap (like 
metallic systems and so on)  the Gamma point only results are just far off, 
and you are forced to do K-point sampling.

>But when I calculate the surface structure with  
>gamma point only, I couldn't use many pools. Therefore the cpu usage for
>gamma point calculation is coming down, about ~20% again. How could I
>calculation with a high cpu useage?    

If you do Gamma point only, than you can exploit the parallelization over 
G-space only. If you wanna use the G-space parallelization over the whole 
cluster than as i mentioned you need a fast interconnect (infiniband, 
myrinet, SCI, quadrics and so on). Gigabit LAN is just too slow for that 
(except for insanely large supercells). Even if you do have a fast 
interconnect, you still better keep an eye on the right number of CPUs for 
the mesh.

All the best,
 Janos.

 ==================================================================
   Janos Kiss   e-mail: janos.kiss at theochem.ruhr-uni-bochum.de       
 Lehrstuhl fuer Theoretische Chemie  Phone: +49 (0)234/32-26485 
 NC 03/297                                  +49 (0)234 32 26754
 Ruhr-Universitaet Bochum            Fax:   +49 (0)234/32-14045
 D-44780 Bochum            http://www.theochem.ruhr-uni-bochum.de
 ==================================================================