[QE-users] Error during diagonalization (memcpy, zhegvdx_gpu) in nscf with many bands (GPU )
Pietro Bonfa
pietro.bonfa at unipr.it
Mon Aug 31 19:08:44 CEST 2020
Dear Sara,
at least the error message is clear now: there's no memory left on the GPU.
You could have guessed this in advance by inspecting the first lines of
the output where the memory estimator reports:
Estimated static dynamical RAM per process > 652.61 MB
Estimated max dynamical RAM per process > 16.82 GB
Estimated total dynamical RAM > 1210.88 GB
The second entry is the important one: you have one process per GPU and
16 GB of memory on each card. Although the estimates is for RAM, it's
generally a good guess also for the GPU memory.
Try using less pools (or more nodes if you desperately need this to run
fast).
Best,
Pietro
On 8/31/20 6:54 PM, Sara Postorino wrote:
> Thank for your response,
>
> I ran it again with 6.5 (couldn't install 6.6a1), it uses the serial
> eigensolver.
>
> now I get :
> Band Structure Calculation
> Davidson diagonalization with overlap
>
> Computing kpt #: 1 of 9 on this pool
> Really copied g2kin H->D
> Really copied evc H->D
> Really copied et H->D
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> Error in routine cegterg (1):
> cannot allocate vc_d
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
> stopping ...
>
> I attach input and output
>
> I'll put the rest on gitlab
>
> Thank you,
> Sara
>
>
> Il giorno dom 30 ago 2020 alle ore 23:18 Pietro Bonfa
> <pietro.bonfa at unipr.it <mailto:pietro.bonfa at unipr.it>> ha scritto:
>
> Dear Sara,
>
> I'd suggest checking the following:
>
> 1. verify that the serial eigensolver is used (it's written at the
> beginning of the output);
>
> 2. use the latest version (6.6a1) that will correctly report problems
> with memory allocations during the iterative diagonalization.
>
> Could you please also open an issue at
> https://gitlab.com/QEF/q-e-gpu/-/issues
> <https://gitlab.com/QEF/q-e-gpu/-/issues>
> and attach the input, the
> pseudopotentials and the job script that you are using?
>
> Thank you,
> kind regards,
> Pietro
>
>
>
> On 8/29/20 6:33 PM, Sara Postorino wrote:
> > Hi QE users,
> >
> > I am running PW on Marconi100 and experiencing problems during
> > digonalization. I am using version 6.5 (autoload of the modules
> on m100).
> > My system is a MoTe2 bilayer k mesh 39x39x1 with many bands due
> to the
> > fact that I will do a GW calculation on top of it. (The calculation
> > works if I do not add many bands)
> > I tried with 4000 and 3000 bands using Davidson diagonalization
> running
> > on 18 nodes:
> > Parallel version (MPI & OpenMP), running on 2304 processor cores
> > Number of MPI processes: 72
> > Threads/MPI process: 32
> > When doin the calculation of the first point I get:
> >
> > Really copied g2kin H->D
> > Really copied evc H->D
> > Really copied et H->D
> > Really copied vrs H->D
> > dp_memcpy_d2h_c2dinvalid pitch argument 12
> >
> > I also tried with Conjugate gradient algorithm but it gets stuck at
> >
> > Really copied evc H->D
> > Really copied et H->D
> > Really copied h_diag H->D
> > Really copied becp%nc H->D
> > Really copied g2kin H->D
> > Really copied vrs H->D
> >
> > And here it takes forever. I left it running for more than 1 hour
> and it
> > didn't finish on k point and since I have 147 kpoints the computation
> > would be very expensive even if it worked.
> >
> > I also tried to go down to 1000 bands (I need way more) and got
> > Really copied g2kin H->D
> > Really copied evc H->D
> > Really copied et H->D
> > Really copied vrs H->D
> > zhegvdx_gpu error: cusolverDnZpotrf failed!
> >
> >
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> > Error in routine cdiaghg_gpu (1):
> > zhegvdx_gpu failed
> >
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> >
> > Do you have any suggestion on how to fix this issue?
> > Thanks
> >
> > Sara Postorino
> > PhD student
> > University of Rome Tor Vergata
> >
> >
> >
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>>
> > Mail priva di virus. www.avast.com
> <http://www.avast.com/>
> >
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>>
> >
> >
> > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> >
> > _______________________________________________
> > Quantum ESPRESSO is supported by MaX
> (http://www.max-centre.eu/quantum-espresso
> <http://www.max-centre.eu/quantum-espresso>
> > users mailing list users at lists.quantum-espresso.org
> <mailto:users at lists.quantum-espresso.org>
> > https://lists.quantum-espresso.org/mailman/listinfo/users
> <https://lists.quantum-espresso.org/mailman/listinfo/users>
> >
>
> Firma il tuo 5 per mille all’Università di Parma e aiuta così i
> nostri studenti che vogliono realizzare un’esperienza di studio
> all’estero - Indica 00308780345 nella tua denuncia dei redditi.
> _______________________________________________
> Quantum ESPRESSO is supported by MaX
> (www.max-centre.eu/quantum-espresso
> <http://www.max-centre.eu/quantum-espresso>)
> users mailing list users at lists.quantum-espresso.org
> <mailto:users at lists.quantum-espresso.org>
> https://lists.quantum-espresso.org/mailman/listinfo/users
> <https://lists.quantum-espresso.org/mailman/listinfo/users>
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
> Mail priva di virus. www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>
>
> <#m_-4887640929092430203_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (http://www.max-centre.eu/quantum-espresso
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
>
Firma il tuo 5 per mille all’Università di Parma e aiuta così i nostri studenti che vogliono realizzare un’esperienza di studio all’estero - Indica 00308780345 nella tua denuncia dei redditi.
More information about the users
mailing list