[QE-users] hp.x Error in routine cdiaghg (270): problems computing cholesky

Mon Jan 29 12:31:24 CET 2024

Dear Iurii,

many thanks for your suggestion. In the meantime I have switched to QE7.3 and tried to recalculate Hubbard parameters for SrTiO3. In QE7.1 I split the calculation for both of the atoms to be perturbed but did not split the calculation for different q-points, it worked and I got reasonable U values.

Now, in QE7.3 I tried the same, but increased the q-grid (from 5x5x5 in the QE7.1 run to 6x6x6 to check if the values differ much), I also tried to manually parallelize the calculation across the q-points, as suggested in https://doi.org/10.48550/arXiv.2203.15684. However, this time I've run the inital scf calculation with U values obtained from the QE7.1 run. 

I always run into the out of memoy problem when committing the calculations to the HPC (for split up hp.x calc and all in one hp.x). Ofc, I requested more and more RAM but after some hours it always stops. After talking to the system administrator of the HPC, it seems like the code allocates more and more RAM with each q-point without freeing them after. The last kernel-log read:

> Jan 28 13:51:45 alcc141 kernel: Tasks state (memory values in pages):
> Jan 28 13:51:45 alcc141 kernel: [  pid  ]   uid  tgid total_vm      
> rss pgtables_bytes swapents oom_score_adj name
> Jan 28 13:51:45 alcc141 kernel: [1731714]     0 1731714 52913     
> 1945    81920        0         -1000 slurmstepd
> Jan 28 13:51:45 alcc141 kernel: [1731726] 150537 1731726 2515      
> 835    57344        0             0 bash
> Jan 28 13:51:45 alcc141 kernel: [1731765] 150537 1731765 87567     
> 2054    90112        0             0 srun
> Jan 28 13:51:45 alcc141 kernel: [1731766] 150537 1731766 3517      
> 197    53248        0             0 srun
> Jan 28 13:51:45 alcc141 kernel: [1731775]     0 1731775 86282     
> 2020    90112        0         -1000 slurmstepd
> Jan 28 13:51:45 alcc141 kernel: [1731781] 150537 1731781 19145371 
> 18290506 147206144        0             0 hp.x
> Jan 28 13:51:45 alcc141 kernel: [1731782] 150537 1731782 922590    
> 65967  1085440        0             0 hp.x
> Jan 28 13:51:45 alcc141 kernel: [1731783] 150537 1731783 19134592 
> 18282298 147103744        0             0 hp.x
> Jan 28 13:51:45 alcc141 kernel: [1731784] 150537 1731784 921010    
> 69907  1077248        0             0 hp.x
> Jan 28 13:51:45 alcc141 kernel: [1731785] 150537 1731785 19140587 
> 18285101 147156992        0             0 hp.x
> Jan 28 13:51:45 alcc141 kernel: [1731786] 150537 1731786 924966    
> 69686  1105920        0             0 hp.x
> Jan 28 13:51:45 alcc141 kernel: [1731787] 150537 1731787 19133569 
> 18279873 147111936        0             0 hp.x
> Jan 28 13:51:45 alcc141 kernel: 
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=step_0,mems_allowed=0-1,oom_memcg=/slurm/uid_150537/job_863717,task_memcg=/
> Jan 28 13:51:45 alcc141 kernel: Memory cgroup out of memory: Killed 
> process 1731781 (hp.x) total-vm:76581484kB, anon-rss:73097632kB, 
> file-rss:8400kB, shmem-rss:55992kB

Is this normal and expected behaivour and I should split the calculation further to only have each run calculate a single q-point? In QE7.1 I only had convergence issues but never OOM. The input, output, slurm out of the last run are appended. I may add that the code never crashes and I have to manually kill the run once I notice OOM. 

Best regards and have a nice week,
Simon

Am Donnerstag, Januar 25, 2024 17:36 CET, schrieb Timrov Iurii <iurii.timrov at psi.ch>:

> Dear Simon,
> 
> You can compute Hubbard parameters using HP on top of the metallic ground state (i.e. with U=0 for your system). Just do one scf with smearing in that case.
> 
> > Would you suggest to take a parameter set from this (e.g LVO_5.0_2.7_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.3585   13.4614: 1.1029) and start the HP scheme from there on?
> 
> Do you mean that you used U=5 eV for La-4f, U=2.7 for V-3d, and U=0 for O-2p? I would try smaller starting U value for La-4f, e.g. 3.2 eV with ortho-atomic orbitals [see PRR 2, 033265 (2020)]. So maybe check first whether you still have a gap with U=3.2 eV for La-4f?
> 
> HTH
> 
> Iurii
> 
> ----------------------------------------------------------
> Dr. Iurii TIMROV
> Tenure-track scientist
> Laboratory for Materials Simulations (LMS)
> Paul Scherrer Institut (PSI)
> CH-5232 Villigen, Switzerland
> +41 56 310 62 14
> https://www.psi.ch/en/lms/people/iurii-timrov
> ________________________________
> From: Simon Imanuel Rombauer <simon.rombauer at student.uni-augsburg.de>
> Sent: Thursday, January 25, 2024 17:20
> To: Timrov Iurii <iurii.timrov at psi.ch>; users at lists.quantum-espresso.org <users at lists.quantum-espresso.org>
> Subject: Re: [QE-users] hp.x Error in routine cdiaghg (270): problems computing cholesky
> 
> Sending again since I feel it didn't work.
> Am Donnerstag, Januar 25, 2024 12:59 CET, schrieb "Simon Imanuel Rombauer" <simon.rombauer at student.uni-augsburg.de>:
> 
> > Dear Iurii,
> >
> > thank you for your response, yes I have noticed this, I thought HP can start from this 'false state' and calculate the U parameters to correctly reflect the Mott-insulator behavior.
> > I also computed a few scf DFT+U with U value of V-3d ranging from 2.7 - 2.9 eV, many of which turned out to be metallic. See (LVO_U(La-4f)_U(V-3d_V(O-2p V-3d))):
> >
> > LVO_5.0_2.7_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.3585   13.4614: 1.1029
> > LVO_5.0_2.7_0.15/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.7936   12.7687: -0.0249
> > LVO_5.0_2.7_0.3/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.8425   12.8166: -0.0259
> > LVO_5.0_2.8_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.4686   13.3603: 0.8917
> > LVO_5.0_2.9_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.7631   12.7638: 0.0007
> > LVO_5.0_2.9_0.15/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.3092   13.5493: 1.2401
> > LVO_5.0_2.9_0.3/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.4914   13.4507: 0.9593
> > LVO_6.0_2.7_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.8108   12.8112: 0.0004
> > LVO_6.0_2.7_0.15/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.8090   12.7859: -0.0231
> > LVO_6.0_2.8_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.4816   13.3813: 0.8997
> > LVO_6.0_2.8_0.15/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.0800   12.6079: -0.4721
> > LVO_6.0_2.8_0.3/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.5317   13.4616: 0.9299
> > LVO_6.0_2.9_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.0412   12.5850: -0.4562
> > LVO_6.0_2.9_0.15/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.3227   13.5668: 1.2441
> > LVO_6.0_2.9_0.3/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.4946   13.4855: 0.9909
> > LVO_7.0_2.7_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.0819   12.6240: -0.4579
> > LVO_7.0_2.7_0.15/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.1313   12.6651: -0.4662
> > LVO_7.0_2.7_0.3/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.8713   12.8486: -0.0227
> > LVO_7.0_2.8_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.0475   12.5857: -0.4618
> > LVO_7.0_2.8_0.15/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.0963   12.6256: -0.4707
> > LVO_7.0_2.8_0.3/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.2079   12.7589: -0.449
> > LVO_7.0_2.9_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.0322   12.5719: -0.4603
> > LVO_7.0_2.9_0.15/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.0812   12.6120: -0.4692
> > LVO_7.0_2.9_0.3/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.1948   12.7461: -0.4487
> > LVO_8.0_2.7_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.1005   12.6553: -0.4452
> > LVO_8.0_2.7_0.15/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.1584   12.6600: -0.4984
> > LVO_8.0_2.8_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    13.1190   12.6864: -0.4326
> > LVO_8.0_2.8_0.15/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.8723   12.8683: -0.004
> > LVO_8.0_2.9_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.8057   12.8032: -0.0025
> > LVO_8.0_2.9_0.15/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.8566   12.8521: -0.0045
> > LVO_8.0_2.9_0.3/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.9079   12.9012: -0.0067
> >
> > Would you suggest to take a parameter set from this (e.g LVO_5.0_2.7_0.0/scf2.out:      highest occupied, lowest unoccupied level (ev):    12.3585   13.4614: 1.1029) and start the HP scheme from there on?
> >
> > All the best,
> > Simon
> >
> > Am Donnerstag, Januar 25, 2024 12:43 CET, schrieb Timrov Iurii <iurii.timrov at psi.ch>:
> >
> > > Dear Simon,
> > >
> > > If you check the output file of the second SCF calculation, you will see this:
> > >  highest occupied, lowest unoccupied level (ev):    13.2680   12.9953
> > >
> > > This means that the system is metallic, and hence your should not use a two-step SCF procedure. Just perform the first SCF calculation with smearing and then proceed to the HP calculation. Or, if the system is experimentally known to be insulating, you can add some finite value of U to V-3d states, which should open a gap and then proceed with the two-step SCF procedure plus HP.
> > >
> > > HTH
> > >
> > > Iurii
> > >
> > > ________________________________
> > > From: users <users-bounces at lists.quantum-espresso.org> on behalf of Simon Imanuel Rombauer <simon.rombauer at student.uni-augsburg.de>
> > > Sent: Wednesday, January 24, 2024 20:42
> > > To: users at lists.quantum-espresso.org <users at lists.quantum-espresso.org>
> > > Subject: [QE-users] hp.x Error in routine cdiaghg (270): problems computing cholesky
> > >
> > > Dear QE users,
> > >
> > > for some time I am trying to find  suitable DFT+U+V parameters for orthorhombic LaVO3 band structure. I was limited with with computational resources so I tried to manually tune the parameters to match experimental band gab. This was very tedious and most calculations did not converge at all. Now I have more CPU cores to work with and want to use the hp.x code to calculate them using DFPT. I followed example 02 and 06 from the documentation, that is I first calculated scf of LVO using a smearing and starting mag. and then did a second scf run with fixed occupation and total mag. = 0. Then I split the HP calculation for each perturbed atom. It always ends with Error in routine  cdiaghg (270):       problems computing cholesky, I have tried to change mixing_mode, mixing_beta, higher ecutwfc and ecutrho, lowered the conv_thr but nothing worked. (input/output files appended)
> > >
> > > Any idea is highly appreciated, also on how to speed up calculations, it still seems rather slow when calculating scf.
> > > All the best and have a nice day
> > >
> > > Simon Rombauer
> > > Master Student Physics
> > > University Augsburg
> > > Germany
> > >
> > > PS: I manually changed the occupation in the La PP from 5d to 4f, but even when I left the PP as it is and simply tried to calculate U for La-5d it crashed with the same error.
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sto_hp.7z
Type: application/octet-stream
Size: 84692 bytes
Desc: not available
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20240129/52b045a2/attachment.obj>