[Pw_forum] PW taskgroups and a large run on a BG/P

David Farrell davidfarrell2008 at u.northwestern.edu
Wed Jan 28 22:18:17 CET 2009


Oddly enough, the same input file, run in dual mode with 1 taskgroup  
(so each process should have access to 1 GB of RAM), doesn't spit out  
the previous error, but rather this one:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
%%%%%%%
      from  pdpotf  : error #         1
       problems computing cholesky decomposition
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
%%%%%%%%%

I would have expected this one to fail the same way. The solution to  
this from the mailing list seems to be to disable the parallel  
cholesky decomposition, but that doesn't seem a very good option in my  
case.

I am trying to re-run this case to see if this error is reproducible,  
and trying the smp-mode version with 1 taskgroup to see if I can get a  
better read on where the MPI_Scatterv is being called from (there was  
no core file for the master process for some reason.)... and I am not  
really sure how to go about finding out the send buffer size (I guess  
a debugger may be the only option?)

Dave




On Jan 28, 2009, at 2:04 PM, Axel Kohlmeyer wrote:

> On Wed, 28 Jan 2009, David Farrell wrote:
>
>
> [...]
>
> DF>     Largest allocated arrays     est. size (Mb)     dimensions
> DF>        Kohn-Sham Wavefunctions        73.76 Mb     (   3147,1536)
> DF>        NL pseudopotentials           227.42 Mb     (   3147,4736)
> DF>        Each V/rho on FFT grid          3.52 Mb     ( 230400)
> DF>        Each G-vector array             0.19 Mb     (  25061)
> DF>        G-vector shells                 0.08 Mb     (  10422)
> DF>     Largest temporary arrays     est. size (Mb)     dimensions
> DF>        Auxiliary wavefunctions        73.76 Mb     (   3147,3072)
> DF>        Each subspace H/S matrix       72.00 Mb     (   3072,3072)
> DF>        Each <psi_i|beta_j> matrix     55.50 Mb     (   4736,1536)
> DF>        Arrays for rho mixing          28.12 Mb     ( 230400,   8)
> DF>
> [...]
> DF> with an like this in the stderr file:
> DF>
> DF> Abort(1) on node 210 (rank 210 in comm 1140850688): Fatal error in
> DF> MPI_Scatterv: Other MPI error, error sta
> DF> ck:
> DF> MPI_Scatterv(360): MPI_Scatterv(sbuf=0x36c02010, scnts=0x7fffa940,
> DF> displs=0x7fffb940, MPI_DOUBLE_PRECISION,
> DF> rbuf=0x4b83010, rcount=230400, MPI_DOUBLE_PRECISION, root=0,
> DF> comm=0x84000002) failed
> DF> MPI_Scatterv(100): Out of memory
> DF>
> DF> So I figure I am running out of memory on a node at some  
> point... but not
> DF> entirely sure where (seems to be in the first electronic step)  
> or how to get
> DF> around it.
>
> it dies on the processor calling MPI_Scatterv, probably the  
> (group)master(s).
> what is interesting is that the rcount size matches the "arrays for  
> rho
> mixing", so i would suggest to first have a look there and try to
> determine how large the combined send buffers are.
>
> cheers,
>   axel.
>
>
> DF>
> DF> Any help would be appreciated.
> DF>
> DF> Dave
> DF>
> DF>
> DF>
> DF>
> DF> David E. Farrell
> DF> Post-Doctoral Fellow
> DF> Department of Materials Science and Engineering
> DF> Northwestern University
> DF> email: d-farrell2 at northwestern.edu
> DF>
> DF>
>
> -- 
> = 
> ======================================================================
> Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http:// 
> www.cmm.upenn.edu
>   Center for Molecular Modeling   --   University of Pennsylvania
> Department of Chemistry, 231 S.34th Street, Philadelphia, PA  
> 19104-6323
> tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
> = 
> ======================================================================
> If you make something idiot-proof, the universe creates a better  
> idiot.

David E. Farrell
Post-Doctoral Fellow
Department of Materials Science and Engineering
Northwestern University
email: d-farrell2 at northwestern.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20090128/87c97029/attachment.html>


More information about the users mailing list