[QE-users] hp.x Error in routine cdiaghg (270): problems computing cholesky
Simon Imanuel Rombauer
simon.rombauer at student.uni-augsburg.de
Mon Jan 29 13:20:08 CET 2024
Again, since I feel every second email isn't sent...
Am Montag, Januar 29, 2024 12:31 CET, schrieb "Simon Imanuel Rombauer" <simon.rombauer at student.uni-augsburg.de>:
> Dear Iurii,
>
> many thanks for your suggestion. In the meantime I have switched to QE7.3 and tried to recalculate Hubbard parameters for SrTiO3. In QE7.1 I split the calculation for both of the atoms to be perturbed but did not split the calculation for different q-points, it worked and I got reasonable U values.
>
> Now, in QE7.3 I tried the same, but increased the q-grid (from 5x5x5 in the QE7.1 run to 6x6x6 to check if the values differ much), I also tried to manually parallelize the calculation across the q-points, as suggested in https://doi.org/10.48550/arXiv.2203.15684. However, this time I've run the inital scf calculation with U values obtained from the QE7.1 run.
>
> I always run into the out of memoy problem when committing the calculations to the HPC (for split up hp.x calc and all in one hp.x). Ofc, I requested more and more RAM but after some hours it always stops. After talking to the system administrator of the HPC, it seems like the code allocates more and more RAM with each q-point without freeing them after. The last kernel-log read:
>
> > Jan 28 13:51:45 alcc141 kernel: Tasks state (memory values in pages):
> > Jan 28 13:51:45 alcc141 kernel: [ pid ] uid tgid total_vm
> > rss pgtables_bytes swapents oom_score_adj name
> > Jan 28 13:51:45 alcc141 kernel: [1731714] 0 1731714 52913
> > 1945 81920 0 -1000 slurmstepd
> > Jan 28 13:51:45 alcc141 kernel: [1731726] 150537 1731726 2515
> > 835 57344 0 0 bash
> > Jan 28 13:51:45 alcc141 kernel: [1731765] 150537 1731765 87567
> > 2054 90112 0 0 srun
> > Jan 28 13:51:45 alcc141 kernel: [1731766] 150537 1731766 3517
> > 197 53248 0 0 srun
> > Jan 28 13:51:45 alcc141 kernel: [1731775] 0 1731775 86282
> > 2020 90112 0 -1000 slurmstepd
> > Jan 28 13:51:45 alcc141 kernel: [1731781] 150537 1731781 19145371
> > 18290506 147206144 0 0 hp.x
> > Jan 28 13:51:45 alcc141 kernel: [1731782] 150537 1731782 922590
> > 65967 1085440 0 0 hp.x
> > Jan 28 13:51:45 alcc141 kernel: [1731783] 150537 1731783 19134592
> > 18282298 147103744 0 0 hp.x
> > Jan 28 13:51:45 alcc141 kernel: [1731784] 150537 1731784 921010
> > 69907 1077248 0 0 hp.x
> > Jan 28 13:51:45 alcc141 kernel: [1731785] 150537 1731785 19140587
> > 18285101 147156992 0 0 hp.x
> > Jan 28 13:51:45 alcc141 kernel: [1731786] 150537 1731786 924966
> > 69686 1105920 0 0 hp.x
> > Jan 28 13:51:45 alcc141 kernel: [1731787] 150537 1731787 19133569
> > 18279873 147111936 0 0 hp.x
> > Jan 28 13:51:45 alcc141 kernel:
> > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=step_0,mems_allowed=0-1,oom_memcg=/slurm/uid_150537/job_863717,task_memcg=/
> > Jan 28 13:51:45 alcc141 kernel: Memory cgroup out of memory: Killed
> > process 1731781 (hp.x) total-vm:76581484kB, anon-rss:73097632kB,
> > file-rss:8400kB, shmem-rss:55992kB
>
> Is this normal and expected behaivour and I should split the calculation further to only have each run calculate a single q-point? In QE7.1 I only had convergence issues but never OOM. The input, output, slurm out of the last run are appended. I may add that the code never crashes and I have to manually kill the run once I notice OOM.
>
> Best regards and have a nice week,
> Simon
>
> Am Donnerstag, Januar 25, 2024 17:36 CET, schrieb Timrov Iurii <iurii.timrov at psi.ch>:
>
> > Dear Simon,
> >
> > You can compute Hubbard parameters using HP on top of the metallic ground state (i.e. with U=0 for your system). Just do one scf with smearing in that case.
> >
> > > Would you suggest to take a parameter set from this (e.g LVO_5.0_2.7_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 12.3585 13.4614: 1.1029) and start the HP scheme from there on?
> >
> > Do you mean that you used U=5 eV for La-4f, U=2.7 for V-3d, and U=0 for O-2p? I would try smaller starting U value for La-4f, e.g. 3.2 eV with ortho-atomic orbitals [see PRR 2, 033265 (2020)]. So maybe check first whether you still have a gap with U=3.2 eV for La-4f?
> >
> > HTH
> >
> > Iurii
> >
> > ----------------------------------------------------------
> > Dr. Iurii TIMROV
> > Tenure-track scientist
> > Laboratory for Materials Simulations (LMS)
> > Paul Scherrer Institut (PSI)
> > CH-5232 Villigen, Switzerland
> > +41 56 310 62 14
> > https://www.psi.ch/en/lms/people/iurii-timrov
> > ________________________________
> > From: Simon Imanuel Rombauer <simon.rombauer at student.uni-augsburg.de>
> > Sent: Thursday, January 25, 2024 17:20
> > To: Timrov Iurii <iurii.timrov at psi.ch>; users at lists.quantum-espresso.org <users at lists.quantum-espresso.org>
> > Subject: Re: [QE-users] hp.x Error in routine cdiaghg (270): problems computing cholesky
> >
> > Sending again since I feel it didn't work.
> > Am Donnerstag, Januar 25, 2024 12:59 CET, schrieb "Simon Imanuel Rombauer" <simon.rombauer at student.uni-augsburg.de>:
> >
> > > Dear Iurii,
> > >
> > > thank you for your response, yes I have noticed this, I thought HP can start from this 'false state' and calculate the U parameters to correctly reflect the Mott-insulator behavior.
> > > I also computed a few scf DFT+U with U value of V-3d ranging from 2.7 - 2.9 eV, many of which turned out to be metallic. See (LVO_U(La-4f)_U(V-3d_V(O-2p V-3d))):
> > >
> > > LVO_5.0_2.7_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 12.3585 13.4614: 1.1029
> > > LVO_5.0_2.7_0.15/scf2.out: highest occupied, lowest unoccupied level (ev): 12.7936 12.7687: -0.0249
> > > LVO_5.0_2.7_0.3/scf2.out: highest occupied, lowest unoccupied level (ev): 12.8425 12.8166: -0.0259
> > > LVO_5.0_2.8_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 12.4686 13.3603: 0.8917
> > > LVO_5.0_2.9_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 12.7631 12.7638: 0.0007
> > > LVO_5.0_2.9_0.15/scf2.out: highest occupied, lowest unoccupied level (ev): 12.3092 13.5493: 1.2401
> > > LVO_5.0_2.9_0.3/scf2.out: highest occupied, lowest unoccupied level (ev): 12.4914 13.4507: 0.9593
> > > LVO_6.0_2.7_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 12.8108 12.8112: 0.0004
> > > LVO_6.0_2.7_0.15/scf2.out: highest occupied, lowest unoccupied level (ev): 12.8090 12.7859: -0.0231
> > > LVO_6.0_2.8_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 12.4816 13.3813: 0.8997
> > > LVO_6.0_2.8_0.15/scf2.out: highest occupied, lowest unoccupied level (ev): 13.0800 12.6079: -0.4721
> > > LVO_6.0_2.8_0.3/scf2.out: highest occupied, lowest unoccupied level (ev): 12.5317 13.4616: 0.9299
> > > LVO_6.0_2.9_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 13.0412 12.5850: -0.4562
> > > LVO_6.0_2.9_0.15/scf2.out: highest occupied, lowest unoccupied level (ev): 12.3227 13.5668: 1.2441
> > > LVO_6.0_2.9_0.3/scf2.out: highest occupied, lowest unoccupied level (ev): 12.4946 13.4855: 0.9909
> > > LVO_7.0_2.7_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 13.0819 12.6240: -0.4579
> > > LVO_7.0_2.7_0.15/scf2.out: highest occupied, lowest unoccupied level (ev): 13.1313 12.6651: -0.4662
> > > LVO_7.0_2.7_0.3/scf2.out: highest occupied, lowest unoccupied level (ev): 12.8713 12.8486: -0.0227
> > > LVO_7.0_2.8_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 13.0475 12.5857: -0.4618
> > > LVO_7.0_2.8_0.15/scf2.out: highest occupied, lowest unoccupied level (ev): 13.0963 12.6256: -0.4707
> > > LVO_7.0_2.8_0.3/scf2.out: highest occupied, lowest unoccupied level (ev): 13.2079 12.7589: -0.449
> > > LVO_7.0_2.9_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 13.0322 12.5719: -0.4603
> > > LVO_7.0_2.9_0.15/scf2.out: highest occupied, lowest unoccupied level (ev): 13.0812 12.6120: -0.4692
> > > LVO_7.0_2.9_0.3/scf2.out: highest occupied, lowest unoccupied level (ev): 13.1948 12.7461: -0.4487
> > > LVO_8.0_2.7_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 13.1005 12.6553: -0.4452
> > > LVO_8.0_2.7_0.15/scf2.out: highest occupied, lowest unoccupied level (ev): 13.1584 12.6600: -0.4984
> > > LVO_8.0_2.8_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 13.1190 12.6864: -0.4326
> > > LVO_8.0_2.8_0.15/scf2.out: highest occupied, lowest unoccupied level (ev): 12.8723 12.8683: -0.004
> > > LVO_8.0_2.9_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 12.8057 12.8032: -0.0025
> > > LVO_8.0_2.9_0.15/scf2.out: highest occupied, lowest unoccupied level (ev): 12.8566 12.8521: -0.0045
> > > LVO_8.0_2.9_0.3/scf2.out: highest occupied, lowest unoccupied level (ev): 12.9079 12.9012: -0.0067
> > >
> > > Would you suggest to take a parameter set from this (e.g LVO_5.0_2.7_0.0/scf2.out: highest occupied, lowest unoccupied level (ev): 12.3585 13.4614: 1.1029) and start the HP scheme from there on?
> > >
> > > All the best,
> > > Simon
> > >
> > > Am Donnerstag, Januar 25, 2024 12:43 CET, schrieb Timrov Iurii <iurii.timrov at psi.ch>:
> > >
> > > > Dear Simon,
> > > >
> > > > If you check the output file of the second SCF calculation, you will see this:
> > > > highest occupied, lowest unoccupied level (ev): 13.2680 12.9953
> > > >
> > > > This means that the system is metallic, and hence your should not use a two-step SCF procedure. Just perform the first SCF calculation with smearing and then proceed to the HP calculation. Or, if the system is experimentally known to be insulating, you can add some finite value of U to V-3d states, which should open a gap and then proceed with the two-step SCF procedure plus HP.
> > > >
> > > > HTH
> > > >
> > > > Iurii
> > > >
> > > > ________________________________
> > > > From: users <users-bounces at lists.quantum-espresso.org> on behalf of Simon Imanuel Rombauer <simon.rombauer at student.uni-augsburg.de>
> > > > Sent: Wednesday, January 24, 2024 20:42
> > > > To: users at lists.quantum-espresso.org <users at lists.quantum-espresso.org>
> > > > Subject: [QE-users] hp.x Error in routine cdiaghg (270): problems computing cholesky
> > > >
> > > > Dear QE users,
> > > >
> > > > for some time I am trying to find suitable DFT+U+V parameters for orthorhombic LaVO3 band structure. I was limited with with computational resources so I tried to manually tune the parameters to match experimental band gab. This was very tedious and most calculations did not converge at all. Now I have more CPU cores to work with and want to use the hp.x code to calculate them using DFPT. I followed example 02 and 06 from the documentation, that is I first calculated scf of LVO using a smearing and starting mag. and then did a second scf run with fixed occupation and total mag. = 0. Then I split the HP calculation for each perturbed atom. It always ends with Error in routine cdiaghg (270): problems computing cholesky, I have tried to change mixing_mode, mixing_beta, higher ecutwfc and ecutrho, lowered the conv_thr but nothing worked. (input/output files appended)
> > > >
> > > > Any idea is highly appreciated, also on how to speed up calculations, it still seems rather slow when calculating scf.
> > > > All the best and have a nice day
> > > >
> > > > Simon Rombauer
> > > > Master Student Physics
> > > > University Augsburg
> > > > Germany
> > > >
> > > > PS: I manually changed the occupation in the La PP from 5d to 4f, but even when I left the PP as it is and simply tried to calculate U for La-5d it crashed with the same error.
> >
More information about the users
mailing list