[Pw_forum] PW taskgroups and a large run on a BG/P

Wed Feb 11 20:16:30 CET 2009

I was able to make a bit more progress - at least in the direction of  
seeing what is breaking as I get to larger numbers of electrons. It  
seems to be mainly due to a number of  large allocates which all of  
the processes carry out. I don't yet know enough about the code to  
know, but I suspect these aren't all necessary.

I have found 2 that have caused problems in different runs. The first  
was the problem that Axel pointed out earlier in this thread, the next  
is this one:

In my periodic bulk case (1156 atoms, 2560 electrons), running on 1024  
procs in vn mode (256 nodes, 512 MB RAM/process):
"add_vuspsi.f90", line 78: 1525-108 Error encountered while attempting  
to allocate a data object.  The program will stop.

- at this line, there is an allocate that is the size of: (number of  
projectors for the atom types) x (number of states).... (so something  
like  4736 * 1536 ... ~55 MB) which seems to be kicking it over the  
limit. This prevented the system from getting to where it would ouput  
info about the SCF steps.

My memory report output for this run looks like:

Largest allocated arrays     est. size (Mb)     dimensions
         Kohn-Sham Wavefunctions         3.35 Mb     (    143,1536)
         NL pseudopotentials            10.33 Mb     (    143,4736)
         Each V/rho on FFT grid          0.78 Mb     (  51200)
         Each G-vector array             0.01 Mb     (   1006)
         G-vector shells                 0.00 Mb     (    488)
      Largest temporary arrays     est. size (Mb)     dimensions
         Auxiliary wavefunctions         6.70 Mb     (    143,6144)
         Each subspace H/S matrix      288.00 Mb     (   6144,6144)
         Each <psi_i|beta_j> matrix     55.50 Mb     (   4736,1536)
         Arrays for rho mixing           6.25 Mb     (  51200,   8)

So I am guessing that something just isn't getting split up right or  
at least not very efficiently - probably the Hamiltonian and Overlap  
matrices as a start.

The above case in dual mode (512 nodes, 1 GB RAM/process) was able to  
get into the SCF stages, and do some output:

Self-consistent Calculation

      iteration #  1     ecut=    38.22 Ry     beta=0.70
      Davidson diagonalization with overlap

but then died with the following memory-related error:

"regterg.f90", line 108: 1525-108 Error encountered while attempting  
to allocate a data object.  The program will stop.

Which appears to be caused by an allocation that looks like the #  
planewaves * # states (or some subset of them).

I guess this isn't really a surprise to you guys, and I am not really  
sure what to do about it, but at least I am now getting some idea of  
what is causing the breakdown.

Dave

On Feb 11, 2009, at 7:18 AM, Paolo Giannozzi wrote:

> Hi, any news on your BG problem? Paolo
>
> -- 
> Paolo Giannozzi, Democritos and University of Udine, Italy

David E. Farrell
Post-Doctoral Fellow
Department of Materials Science and Engineering
Northwestern University
email: d-farrell2 at northwestern.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20090211/cabaa17c/attachment.html>