[Pw_forum] PW taskgroups and a large run on a BG/P

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Wed Jan 28 22:58:54 CET 2009


On Wed, 28 Jan 2009, David Farrell wrote:

DF> I am trying to re-run this case to see if this error is reproducible, and
DF> trying the smp-mode version with 1 taskgroup to see if I can get a better
DF> read on where the MPI_Scatterv is being called from (there was no core file
DF> for the master process for some reason.)... and I am not really sure how to
DF> go about finding out the send buffer size (I guess a debugger may be the
DF> only option?)

grep does the job. MPI_Scatterv is used only in fft_base.f90 and only 
in three subroutines. pw.x uses only two of them and one is used in 
exx.f90, which you most likely don't use, however the other use in
psymrho.f90/psymrho_mag.f90 is consistent with the problem happening
at the rho-mixing stage. you have a "grid size" allocate/deallocate
around the call to grid_scatter in psymrho.f90, so that may lead to
exhausting the memory when MPI_SCATTERV needs it internally...

HTH,
   axel.


DF> 
DF> Dave
DF> 
DF> 
DF> 
DF> 
DF> On Jan 28, 2009, at 2:04 PM, Axel Kohlmeyer wrote:
DF> 
DF> >On Wed, 28 Jan 2009, David Farrell wrote:
DF> >
DF> >
DF> >[...]
DF> >
DF> > DF>     Largest allocated arrays     est. size (Mb)     dimensions
DF> > DF>        Kohn-Sham Wavefunctions        73.76 Mb     (   3147,1536)
DF> > DF>        NL pseudopotentials           227.42 Mb     (   3147,4736)
DF> > DF>        Each V/rho on FFT grid          3.52 Mb     ( 230400)
DF> > DF>        Each G-vector array             0.19 Mb     (  25061)
DF> > DF>        G-vector shells                 0.08 Mb     (  10422)
DF> > DF>     Largest temporary arrays     est. size (Mb)     dimensions
DF> > DF>        Auxiliary wavefunctions        73.76 Mb     (   3147,3072)
DF> > DF>        Each subspace H/S matrix       72.00 Mb     (   3072,3072)
DF> > DF>        Each <psi_i|beta_j> matrix     55.50 Mb     (   4736,1536)
DF> > DF>        Arrays for rho mixing          28.12 Mb     ( 230400,   8)
DF> > DF>
DF> >[...]
DF> > DF> with an like this in the stderr file:
DF> > DF>
DF> > DF> Abort(1) on node 210 (rank 210 in comm 1140850688): Fatal error in
DF> > DF> MPI_Scatterv: Other MPI error, error sta
DF> > DF> ck:
DF> > DF> MPI_Scatterv(360): MPI_Scatterv(sbuf=0x36c02010, scnts=0x7fffa940,
DF> > DF> displs=0x7fffb940, MPI_DOUBLE_PRECISION,
DF> > DF> rbuf=0x4b83010, rcount=230400, MPI_DOUBLE_PRECISION, root=0,
DF> > DF> comm=0x84000002) failed
DF> > DF> MPI_Scatterv(100): Out of memory
DF> > DF>
DF> > DF> So I figure I am running out of memory on a node at some point... but
DF> > DF> not
DF> > DF> entirely sure where (seems to be in the first electronic step) or how
DF> > DF> to get
DF> > DF> around it.
DF> >
DF> >it dies on the processor calling MPI_Scatterv, probably the
DF> >(group)master(s).
DF> >what is interesting is that the rcount size matches the "arrays for rho
DF> >mixing", so i would suggest to first have a look there and try to
DF> >determine how large the combined send buffers are.
DF> >
DF> >cheers,
DF> >  axel.
DF> >
DF> >
DF> > DF>
DF> > DF> Any help would be appreciated.
DF> > DF>
DF> > DF> Dave
DF> > DF>
DF> > DF>
DF> > DF>
DF> > DF>
DF> > DF> David E. Farrell
DF> > DF> Post-Doctoral Fellow
DF> > DF> Department of Materials Science and Engineering
DF> > DF> Northwestern University
DF> > DF> email: d-farrell2 at northwestern.edu
DF> > DF>
DF> > DF>
DF> >
DF> >-- 
DF> >=======================================================================
DF> >Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
DF> >  Center for Molecular Modeling   --   University of Pennsylvania
DF> >Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
DF> >tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
DF> >=======================================================================
DF> >If you make something idiot-proof, the universe creates a better idiot.
DF> 
DF> David E. Farrell
DF> Post-Doctoral Fellow
DF> Department of Materials Science and Engineering
DF> Northwestern University
DF> email: d-farrell2 at northwestern.edu
DF> 
DF> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.



More information about the users mailing list