[Pw_forum] PW taskgroups and a large run on a BG/P
David Farrell
davidfarrell2008 at u.northwestern.edu
Wed Jan 28 22:18:17 CET 2009
Oddly enough, the same input file, run in dual mode with 1 taskgroup
(so each process should have access to 1 GB of RAM), doesn't spit out
the previous error, but rather this one:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%
from pdpotf : error # 1
problems computing cholesky decomposition
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%
I would have expected this one to fail the same way. The solution to
this from the mailing list seems to be to disable the parallel
cholesky decomposition, but that doesn't seem a very good option in my
case.
I am trying to re-run this case to see if this error is reproducible,
and trying the smp-mode version with 1 taskgroup to see if I can get a
better read on where the MPI_Scatterv is being called from (there was
no core file for the master process for some reason.)... and I am not
really sure how to go about finding out the send buffer size (I guess
a debugger may be the only option?)
Dave
On Jan 28, 2009, at 2:04 PM, Axel Kohlmeyer wrote:
> On Wed, 28 Jan 2009, David Farrell wrote:
>
>
> [...]
>
> DF> Largest allocated arrays est. size (Mb) dimensions
> DF> Kohn-Sham Wavefunctions 73.76 Mb ( 3147,1536)
> DF> NL pseudopotentials 227.42 Mb ( 3147,4736)
> DF> Each V/rho on FFT grid 3.52 Mb ( 230400)
> DF> Each G-vector array 0.19 Mb ( 25061)
> DF> G-vector shells 0.08 Mb ( 10422)
> DF> Largest temporary arrays est. size (Mb) dimensions
> DF> Auxiliary wavefunctions 73.76 Mb ( 3147,3072)
> DF> Each subspace H/S matrix 72.00 Mb ( 3072,3072)
> DF> Each <psi_i|beta_j> matrix 55.50 Mb ( 4736,1536)
> DF> Arrays for rho mixing 28.12 Mb ( 230400, 8)
> DF>
> [...]
> DF> with an like this in the stderr file:
> DF>
> DF> Abort(1) on node 210 (rank 210 in comm 1140850688): Fatal error in
> DF> MPI_Scatterv: Other MPI error, error sta
> DF> ck:
> DF> MPI_Scatterv(360): MPI_Scatterv(sbuf=0x36c02010, scnts=0x7fffa940,
> DF> displs=0x7fffb940, MPI_DOUBLE_PRECISION,
> DF> rbuf=0x4b83010, rcount=230400, MPI_DOUBLE_PRECISION, root=0,
> DF> comm=0x84000002) failed
> DF> MPI_Scatterv(100): Out of memory
> DF>
> DF> So I figure I am running out of memory on a node at some
> point... but not
> DF> entirely sure where (seems to be in the first electronic step)
> or how to get
> DF> around it.
>
> it dies on the processor calling MPI_Scatterv, probably the
> (group)master(s).
> what is interesting is that the rcount size matches the "arrays for
> rho
> mixing", so i would suggest to first have a look there and try to
> determine how large the combined send buffers are.
>
> cheers,
> axel.
>
>
> DF>
> DF> Any help would be appreciated.
> DF>
> DF> Dave
> DF>
> DF>
> DF>
> DF>
> DF> David E. Farrell
> DF> Post-Doctoral Fellow
> DF> Department of Materials Science and Engineering
> DF> Northwestern University
> DF> email: d-farrell2 at northwestern.edu
> DF>
> DF>
>
> --
> =
> ======================================================================
> Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://
> www.cmm.upenn.edu
> Center for Molecular Modeling -- University of Pennsylvania
> Department of Chemistry, 231 S.34th Street, Philadelphia, PA
> 19104-6323
> tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
> =
> ======================================================================
> If you make something idiot-proof, the universe creates a better
> idiot.
David E. Farrell
Post-Doctoral Fellow
Department of Materials Science and Engineering
Northwestern University
email: d-farrell2 at northwestern.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20090128/87c97029/attachment.html>
More information about the users
mailing list