[Pw_forum] PW taskgroups and a large run on a BG/P

Nichols A. Romero naromero at gmail.com
Wed Jan 28 22:37:06 CET 2009


David,

Seems like PWSCF calls Scalapack for its Cholesky decomposition. If you
don't care too much about performance, doing the Cholesky decomposition on
your system size would not slow you down terribly. However, you might
pay a memory penalty. Depends on how the Cholesky decomposition is done. I
imagine they probably collect
the entire overlap matrix in one place before sending it to Scalapack.

On Wed, Jan 28, 2009 at 3:18 PM, David Farrell <
davidfarrell2008 at u.northwestern.edu> wrote:

> Oddly enough, the same input file, run in dual mode with 1 taskgroup (so
> each process should have access to 1 GB of RAM), doesn't spit out the
> previous error, but rather this one:
>
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>      from  pdpotf  : error #         1
>       problems computing cholesky decomposition
>
>  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
> I would have expected this one to fail the same way. The solution to this
> from the mailing list seems to be to disable the parallel cholesky
> decomposition, but that doesn't seem a very good option in my case.
>
> I am trying to re-run this case to see if this error is reproducible, and
> trying the smp-mode version with 1 taskgroup to see if I can get a better
> read on where the MPI_Scatterv is being called from (there was no core file
> for the master process for some reason.)... and I am not really sure how to
> go about finding out the send buffer size (I guess a debugger may be the
> only option?)
>
> Dave
>
>
>
>
> On Jan 28, 2009, at 2:04 PM, Axel Kohlmeyer wrote:
>
> On Wed, 28 Jan 2009, David Farrell wrote:
>
>
> [...]
>
> DF>     Largest allocated arrays     est. size (Mb)     dimensions
> DF>        Kohn-Sham Wavefunctions        73.76 Mb     (   3147,1536)
> DF>        NL pseudopotentials           227.42 Mb     (   3147,4736)
> DF>        Each V/rho on FFT grid          3.52 Mb     ( 230400)
> DF>        Each G-vector array             0.19 Mb     (  25061)
> DF>        G-vector shells                 0.08 Mb     (  10422)
> DF>     Largest temporary arrays     est. size (Mb)     dimensions
> DF>        Auxiliary wavefunctions        73.76 Mb     (   3147,3072)
> DF>        Each subspace H/S matrix       72.00 Mb     (   3072,3072)
> DF>        Each <psi_i|beta_j> matrix     55.50 Mb     (   4736,1536)
> DF>        Arrays for rho mixing          28.12 Mb     ( 230400,   8)
> DF>
> [...]
> DF> with an like this in the stderr file:
> DF>
> DF> Abort(1) on node 210 (rank 210 in comm 1140850688): Fatal error in
> DF> MPI_Scatterv: Other MPI error, error sta
> DF> ck:
> DF> MPI_Scatterv(360): MPI_Scatterv(sbuf=0x36c02010, scnts=0x7fffa940,
> DF> displs=0x7fffb940, MPI_DOUBLE_PRECISION,
> DF> rbuf=0x4b83010, rcount=230400, MPI_DOUBLE_PRECISION, root=0,
> DF> comm=0x84000002) failed
> DF> MPI_Scatterv(100): Out of memory
> DF>
> DF> So I figure I am running out of memory on a node at some point... but
> not
> DF> entirely sure where (seems to be in the first electronic step) or how
> to get
> DF> around it.
>
> it dies on the processor calling MPI_Scatterv, probably the
> (group)master(s).
> what is interesting is that the rcount size matches the "arrays for rho
> mixing", so i would suggest to first have a look there and try to
> determine how large the combined send buffers are.
>
> cheers,
>   axel.
>
>
> DF>
> DF> Any help would be appreciated.
> DF>
> DF> Dave
> DF>
> DF>
> DF>
> DF>
> DF> David E. Farrell
> DF> Post-Doctoral Fellow
> DF> Department of Materials Science and Engineering
> DF> Northwestern University
> DF> email: d-farrell2 at northwestern.edu
> DF>
> DF>
>
> --
> =======================================================================
> Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
>   Center for Molecular Modeling   --   University of Pennsylvania
> Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
> tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
> =======================================================================
> If you make something idiot-proof, the universe creates a better idiot.
>
>
> David E. Farrell
> Post-Doctoral Fellow
> Department of Materials Science and Engineering
> Northwestern University
> email: d-farrell2 at northwestern.edu
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum
>
>


-- 
Nichols A. Romero, Ph.D.
Argonne Leadership Computing Facility
Argonne, IL 60490
(630) 252-3441 (O)
(630) 470-0462 (C)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20090128/f81f81db/attachment.html>


More information about the users mailing list