[Pw_forum] Job crashes on multiple nodes

Gabriele Sclauzero sclauzer at sissa.it
Mon May 31 10:12:25 CEST 2010


Dear Wolfgang,

Il giorno 31/mag/10, alle ore 09:53, Wolfgang Gehricht ha scritto:

> Dear group!
>
> I am experiencing the following problem with pw.x. When I run a job  
> on a single core with 8 processors, the calculation does not exceed  
> the available RAM, i.e. it works. When I run the same job on two  
> cores, the calculation crashes (mpierun: Forwarding signal 12 to  
> job) [relevant log-parts see below]. However, I can compute this job  
> on two cores with a smaller k-point sampling, hence I suspect that  
> it has to do somehow with the memory demands/distribution. I am  
> using the "Davidson iterative diagonalization" as a minimizer, with  
> just settings for the thresholds (convergence, 1st iterative  
> diagonalization).
> Can you please point me into the right direction?
> With thanks
> Yours Wolfgang
> ---
> Parallel version (MPI)
>
> Number of processors in use: 16
> R & G space division: proc/pool = 16

It's not a good idea to run 16 processes on 8 cores... even worst on 2  
cores! The data is distributed among the processes, so the more  
processes in the R&G pool, the smaller amount of memory needed per  
processes. Anyway, if you run all processes on the same core, you are  
not distributing the data *physically*. Moreover these processes will  
tend to  stomp on each other's feet.

About the k-point sampling (assuming that you don't use  
parallelization over k-points, i.e. npool=1): the largest data arrays  
are present in the main memory only for one k-point at a time (unless  
you use options like disk_io='none' or so), so that incresing the  
number of k-points  will increase the computation time almost  
linearly, but will leave the memory consumption almost unchanged.

If the problem is with the parallel Davidson diagonalization, you may  
try to disable it with ndiag=1.

All that said, from the memory usage estimate reported below, it looks  
like you're not running such a big system... nothing that couldn't be  
run on a laptop with 2GB of memory.


HTH


GS

> ...
> Subspace diagonalization in iterative solution of the eigenvalue  
> problem:
> a parallel distributed memory algorithm will be used,
> eigenstates matrixes will be distributed block like on
> ortho sub-group = 4* 4 procs
> ...
> Planes per process (thick) : nr3 = 90 npp = 6 ncplane = 8100
>
> Proc/ planes cols G planes cols G columns G
> Pool (dense grid) (smooth grid) (wavefct grid)
> 1 6 339 18817 6 339 18817 92 2668
> 2 6 339 18817 6 339 18817 92 2668
> 3 6 339 18817 6 339 18817 92 2668
> 4 6 338 18812 6 338 18812 92 2668
> 5 6 338 18812 6 338 18812 93 2669
> 6 6 338 18812 6 338 18812 93 2669
> 7 6 338 18812 6 338 18812 93 2669
> 8 6 338 18812 6 338 18812 93 2669
> 9 6 338 18812 6 338 18812 93 2669
> 10 6 338 18812 6 338 18812 93 2669
> 11 5 339 18817 5 339 18817 93 2669
> 12 5 339 18815 5 339 18815 93 2669
> 13 5 339 18815 5 339 18815 93 2669
> 14 5 339 18815 5 339 18815 92 2666
> 15 5 339 18815 5 339 18815 92 2666
> 16 5 339 18815 5 339 18815 92 2666
> tot 90 5417 301027 90 5417 301027 1481 42691
> ...
> G cutoff = 1729.8995 ( 301027 G-vectors) FFT grid: ( 90, 90, 90)
>
> Largest allocated arrays est. size (Mb) dimensions
> Kohn-Sham Wavefunctions 4.63 Mb ( 2373, 128)
> NL pseudopotentials 9.12 Mb ( 2373, 252)
> Each V/rho on FFT grid 1.48 Mb ( 48600, 2)
> Each G-vector array 0.14 Mb ( 18817)
> G-vector shells 0.01 Mb ( 1370)
> Largest temporary arrays est. size (Mb) dimensions
> Auxiliary wavefunctions 18.54 Mb ( 2373, 512)
> Each subspace H/S matrix 4.00 Mb ( 512, 512)
> Each <psi_i|beta_j> matrix 0.49 Mb ( 252, 128)
> Arrays for rho mixing 5.93 Mb ( 48600, 8)
>
> Initial potential from superposition of free atoms
> 16 total processes killed (some possibly by mpirun during cleanup)
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum


§ Gabriele Sclauzero, EPFL SB ITP CSEA
    PH H2 462, Station 3, CH-1015 Lausanne

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20100531/7727cab4/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1753 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20100531/7727cab4/attachment.p7s>


More information about the users mailing list