[QE-users] Error during diagonalization (memcpy, zhegvdx_gpu) in nscf with many bands (GPU )

Pietro Bonfa pietro.bonfa at unipr.it
Mon Aug 31 19:08:44 CEST 2020


Dear Sara,

at least the error message is clear now: there's no memory left on the GPU.

You could have guessed this in advance by inspecting the first lines of
the output where the memory estimator reports:

      Estimated static dynamical RAM per process >     652.61 MB
      Estimated max dynamical RAM per process >      16.82 GB
      Estimated total dynamical RAM >    1210.88 GB

The second entry is the important one: you have one process per GPU and
16 GB of memory on each card. Although the estimates is for RAM, it's
generally a good guess also for the GPU memory.

Try using less pools (or more nodes if you desperately need this to run
fast).

Best,
Pietro



On 8/31/20 6:54 PM, Sara Postorino wrote:
> Thank for your response,
>
> I ran it again with 6.5 (couldn't install 6.6a1), it uses the serial
> eigensolver.
>
> now I get :
>       Band Structure Calculation
>       Davidson diagonalization with overlap
>
>       Computing kpt #:     1  of     9 on this pool
>   Really copied g2kin H->D
>   Really copied evc H->D
>   Really copied et H->D
>
>   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>       Error in routine  cegterg (1):
>        cannot allocate vc_d
>   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
>       stopping ...
>
> I attach input and output
>
> I'll put the rest on gitlab
>
> Thank you,
> Sara
>
>
> Il giorno dom 30 ago 2020 alle ore 23:18 Pietro Bonfa
> <pietro.bonfa at unipr.it <mailto:pietro.bonfa at unipr.it>> ha scritto:
>
>     Dear Sara,
>
>     I'd suggest checking the following:
>
>     1. verify that the serial eigensolver is used (it's written at the
>     beginning of the output);
>
>     2. use the latest version (6.6a1) that will correctly report problems
>     with memory allocations during the iterative diagonalization.
>
>     Could you please also open an issue at
>     https://gitlab.com/QEF/q-e-gpu/-/issues
>     <https://gitlab.com/QEF/q-e-gpu/-/issues>
>     and attach the input, the
>     pseudopotentials and the job script that you are using?
>
>     Thank you,
>     kind regards,
>     Pietro
>
>
>
>     On 8/29/20 6:33 PM, Sara Postorino wrote:
>      > Hi QE users,
>      >
>      > I am running PW on Marconi100 and experiencing problems during
>      > digonalization. I am using version 6.5 (autoload of the modules
>     on m100).
>      > My system is a MoTe2 bilayer k mesh 39x39x1 with many bands due
>     to the
>      > fact that I will do a GW calculation on top of it. (The calculation
>      > works if I do not add many bands)
>      > I tried with 4000 and 3000 bands using Davidson diagonalization
>     running
>      > on 18 nodes:
>      > Parallel version (MPI & OpenMP), running on    2304 processor cores
>      >       Number of MPI processes:                72
>      >       Threads/MPI process:                    32
>      > When doin the calculation of the first point I get:
>      >
>      >   Really copied g2kin H->D
>      >   Really copied evc H->D
>      >   Really copied et H->D
>      >   Really copied vrs H->D
>      >   dp_memcpy_d2h_c2dinvalid pitch argument           12
>      >
>      > I also tried with Conjugate gradient algorithm but  it gets stuck at
>      >
>      >   Really copied evc H->D
>      >   Really copied et H->D
>      >   Really copied h_diag H->D
>      >   Really copied becp%nc H->D
>      >   Really copied g2kin H->D
>      >   Really copied vrs H->D
>      >
>      > And here it takes forever. I left it running for more than 1 hour
>     and it
>      > didn't finish on k point and since I have 147 kpoints the computation
>      > would be very expensive even if it worked.
>      >
>      > I also tried to go down to 1000 bands (I need way more) and got
>      >   Really copied g2kin H->D
>      >   Really copied evc H->D
>      >   Really copied et H->D
>      >   Really copied vrs H->D
>      >   zhegvdx_gpu error: cusolverDnZpotrf failed!
>      >
>      >
>       %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>      >       Error in routine  cdiaghg_gpu (1):
>      >        zhegvdx_gpu failed
>      >
>       %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>      >
>      > Do you have any suggestion on how to fix this issue?
>      > Thanks
>      >
>      > Sara Postorino
>      > PhD student
>      > University of Rome Tor Vergata
>      >
>      >
>      >
>     <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
>     <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>>
>      >       Mail priva di virus. www.avast.com
>     <http://www.avast.com/>
>      >
>     <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
>     <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>>
>      >
>      >
>      > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>      >
>      > _______________________________________________
>      > Quantum ESPRESSO is supported by MaX
>     (http://www.max-centre.eu/quantum-espresso
>     <http://www.max-centre.eu/quantum-espresso>
>      > users mailing list users at lists.quantum-espresso.org
>     <mailto:users at lists.quantum-espresso.org>
>      > https://lists.quantum-espresso.org/mailman/listinfo/users
>     <https://lists.quantum-espresso.org/mailman/listinfo/users>
>      >
>
>     Firma il tuo 5 per mille all’Università di Parma e aiuta così i
>     nostri studenti che vogliono realizzare un’esperienza di studio
>     all’estero - Indica 00308780345 nella tua denuncia dei redditi.
>     _______________________________________________
>     Quantum ESPRESSO is supported by MaX
>     (www.max-centre.eu/quantum-espresso
>     <http://www.max-centre.eu/quantum-espresso>)
>     users mailing list users at lists.quantum-espresso.org
>     <mailto:users at lists.quantum-espresso.org>
>     https://lists.quantum-espresso.org/mailman/listinfo/users
>     <https://lists.quantum-espresso.org/mailman/listinfo/users>
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>       Mail priva di virus. www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>
>
> <#m_-4887640929092430203_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (http://www.max-centre.eu/quantum-espresso
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
>

Firma il tuo 5 per mille all’Università di Parma e aiuta così i nostri studenti che vogliono realizzare un’esperienza di studio all’estero - Indica 00308780345 nella tua denuncia dei redditi.


More information about the users mailing list