[Pw_forum] PW taskgroups and a large run on a BG/P
David Farrell
davidfarrell2008 at u.northwestern.edu
Wed Jan 28 20:46:12 CET 2009
I am trying to run a 1152 atom, 2560 electron pw MD system on a BG/P,
and I believe I am running up against memory issues (not a
surprise...)- but I am not exactly sure how to debug & solve the
issue. I am trying to run on 1024 procs (I've tried in smp, dual and
vn mode), though I guess I may have to go higher - but I am not
certain yet.
I have kept npools =1 , nimage = 1, since the didn't seem applicable
to my run. I have tried varying ntg to 1, 2 & 32 and kept ndiag to the
default.
While varying the number of taskgroups (ntg), I would get output like
the following:
Parallel version (MPI)
Number of processors in use: 1024
R & G space division: proc/pool = 1024
wavefunctions fft division: fft/group = 2
For Norm-Conserving or Ultrasoft (Vanderbilt) Pseudopotentials
or PAW
Current dimensions of program pwscf are:
Max number of different atomic species (ntypx) = 10
Max number of k-points (npk) = 40000
Max angular momentum in pseudopotentials (lmaxx) = 3
gamma-point specific algorithms are used
Iterative solution of the eigenvalue problem
a parallel distributed memory algorithm will be used,
eigenstates matrixes will be distributed block like on
ortho sub-group = 32* 32 procs
Message from routine data_structure:
some processors have no planes
Message from routine data_structure:
some processors have no smooth planes
Planes per process (thick) : nr3 =480 npp = 1 ncplane =*****
Proc/ planes cols G planes cols G columns G
Pool (dense grid) (smooth grid) (wavefct grid)
1 1 162 50122 1 162 50122 42 6294
2 0 162 50122 0 162 50122 42 6294
3 1 162 50122 1 162 50122 42 6294
4 0 162 50122 0 162 50122 42 6294
5 1 162 50122 1 162 50122 42 6294
(continues similarly for each of the 1024 procs)
So the number of FFT planes that need to be distributed is 480. The
output below that made it seem like there were processors that which
still weren't taking part in the calculation, and presumably weren't
helping out in the distribution of the data.
My understanding is that the processors of each taskgroup would take
part in the FFT calculation for the plane associated with the task
group. So my first question is - is the fact that some procs don't
have planes in my output actually an issue?
the output continues and the run finally dies here:
Largest allocated arrays est. size (Mb) dimensions
Kohn-Sham Wavefunctions 73.76 Mb ( 3147,1536)
NL pseudopotentials 227.42 Mb ( 3147,4736)
Each V/rho on FFT grid 3.52 Mb ( 230400)
Each G-vector array 0.19 Mb ( 25061)
G-vector shells 0.08 Mb ( 10422)
Largest temporary arrays est. size (Mb) dimensions
Auxiliary wavefunctions 73.76 Mb ( 3147,3072)
Each subspace H/S matrix 72.00 Mb ( 3072,3072)
Each <psi_i|beta_j> matrix 55.50 Mb ( 4736,1536)
Arrays for rho mixing 28.12 Mb ( 230400, 8)
Initial potential from superposition of free atoms
Check: negative starting charge= -7.401460
starting charge 2556.45492, renormalised to 2560.00000
negative rho (up, down): 0.741E+01 0.000E+00
Starting wfc are 2944 atomic wfcs
total cpu time spent up to now is 704.01 secs
per-process dynamical memory: 13.6 Mb
Self-consistent Calculation
iteration # 1 ecut= 38.22 Ry beta=0.70
Davidson diagonalization with overlap
ethr = 1.00E-02, avg # of iterations = 2.0
process group 2362 has completed
with an like this in the stderr file:
Abort(1) on node 210 (rank 210 in comm 1140850688): Fatal error in
MPI_Scatterv: Other MPI error, error sta
ck:
MPI_Scatterv(360): MPI_Scatterv(sbuf=0x36c02010, scnts=0x7fffa940,
displs=0x7fffb940, MPI_DOUBLE_PRECISION,
rbuf=0x4b83010, rcount=230400, MPI_DOUBLE_PRECISION, root=0,
comm=0x84000002) failed
MPI_Scatterv(100): Out of memory
So I figure I am running out of memory on a node at some point... but
not entirely sure where (seems to be in the first electronic step) or
how to get around it.
Any help would be appreciated.
Dave
David E. Farrell
Post-Doctoral Fellow
Department of Materials Science and Engineering
Northwestern University
email: d-farrell2 at northwestern.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20090128/11246cce/attachment.html>
More information about the users
mailing list