[QE-users] Error during diagonalization (memcpy, zhegvdx_gpu) in nscf with many bands (GPU )

Pietro Bonfa pietro.bonfa at unipr.it
Sun Aug 30 23:17:46 CEST 2020


Dear Sara,

I'd suggest checking the following:

1. verify that the serial eigensolver is used (it's written at the
beginning of the output);

2. use the latest version (6.6a1) that will correctly report problems
with memory allocations during the iterative diagonalization.

Could you please also open an issue at
https://gitlab.com/QEF/q-e-gpu/-/issues and attach the input, the
pseudopotentials and the job script that you are using?

Thank you,
kind regards,
Pietro



On 8/29/20 6:33 PM, Sara Postorino wrote:
> Hi QE users,
>
> I am running PW on Marconi100 and experiencing problems during
> digonalization. I am using version 6.5 (autoload of the modules on m100).
> My system is a MoTe2 bilayer k mesh 39x39x1 with many bands due to the
> fact that I will do a GW calculation on top of it. (The calculation
> works if I do not add many bands)
> I tried with 4000 and 3000 bands using Davidson diagonalization running
> on 18 nodes:
> Parallel version (MPI & OpenMP), running on    2304 processor cores
>       Number of MPI processes:                72
>       Threads/MPI process:                    32
> When doin the calculation of the first point I get:
>
>   Really copied g2kin H->D
>   Really copied evc H->D
>   Really copied et H->D
>   Really copied vrs H->D
>   dp_memcpy_d2h_c2dinvalid pitch argument           12
>
> I also tried with Conjugate gradient algorithm but  it gets stuck at
>
>   Really copied evc H->D
>   Really copied et H->D
>   Really copied h_diag H->D
>   Really copied becp%nc H->D
>   Really copied g2kin H->D
>   Really copied vrs H->D
>
> And here it takes forever. I left it running for more than 1 hour and it
> didn't finish on k point and since I have 147 kpoints the computation
> would be very expensive even if it worked.
>
> I also tried to go down to 1000 bands (I need way more) and got
>   Really copied g2kin H->D
>   Really copied evc H->D
>   Really copied et H->D
>   Really copied vrs H->D
>   zhegvdx_gpu error: cusolverDnZpotrf failed!
>
>   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>       Error in routine  cdiaghg_gpu (1):
>        zhegvdx_gpu failed
>   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
> Do you have any suggestion on how to fix this issue?
> Thanks
>
> Sara Postorino
> PhD student
> University of Rome Tor Vergata
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>       Mail priva di virus. www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>
>
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (http://www.max-centre.eu/quantum-espresso
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
>

Firma il tuo 5 per mille all’Università di Parma e aiuta così i nostri studenti che vogliono realizzare un’esperienza di studio all’estero - Indica 00308780345 nella tua denuncia dei redditi.


More information about the users mailing list