Paolo Giannozzi
Wed Jun 12 18:30:23 CEST 2013

Your unit cells are quite large, your cutoff is not small, and you
use spin-orbit, a feature that increases the memory footprint and 
is less optimized than "plain-vanilla" calculations. In order to 
run such large jobs, one needs to know quite a bit about the
inner working of parallelization, which arrays are distributed,
which are not ... The following arrays, for instance:

>         Each <psi_i|beta_j> matrix    350.63 Mb     (   5440, 2, 2112)

are not distributed. This is the kind of arrays that causes bottlenecks.
If you have N mpi processes per node, you have N such arrays filling
the same physical memory. Reducing the number of MPI processes per node
and using OpenMP instead might be a good strategy.

