We have a rather big system (a=b=c= 20 angs) of 646 atoms, 2160 states. We are running it at the Gamma point with PBE functional (no hybrid). 
Here are our settings:
We use 6*12=72 processors.
Task groups = 1
We set nr1s, nr2s and nr3s to 216, so that 72 is a divisor of 216, as suggest in the manual (we are using USPP, ecut=48 Ry, ecutRho=10*ecut).
FFTs are performed on a 6*6 grid (option ndiag=36).

The job runs for 1 hour (I set this limit for the test in the queuing system) but QE cannot perform even the 1st step of the SCF cycle (last lines of output = iteration #  1     ecut=    48.00 Ry     beta=0.20
     Davidson diagonalization with overlap).

Are the above settings wrong? Or should we just use more processors? 
Could the openmp option be useful in this case?

