[QE-users] Optimal pw command line for large systems and only Gamma point

Mon May 13 15:15:02 CEST 2024

I did some tests. For 1000 Si atoms, I use 2010 bands because I need to 
get the band gap value; moreover, being a cluster, the surface states of 
the truncated bonds might close the gap, especially at the first steps 
of the geometry optimization, so it's better I use few empty bands. I 
managed to run the calculation by using 10 nodes and a max of 40 cores 
per node. My question now is: can you suggest me optimal command line 
options and/or input settings to speed up the calculation? And, if 
possible, also to reduce the number of nodes? The relevant parameters in 
the input file are the following:

     input_dft= 'pz'
     ecutwfc= 25
     occupations= 'smearing'
     smearing= 'cold'
     degauss= 0.05 ! I know it's quite large, but necessary to stabilize 
the SCF at this preliminary stage (no geometry step done yet)
     nbnd= 2010

     diagonalization= 'ppcg'
     mixing_mode= 'plain'
     mixing_beta= 0.4

The actual time spent per scf cycle is about 33 minutes. I use QE v. 7.3 
compiled with openmpi and scalapack. I have access to the intel 
compilers too but I did some tests and the difference is just tens of 
seconds. I have only the Gamma point; please, here you have some info 
about the grid and the estimated RAM usage:

      Dense  grid: 24616397 G-vectors     FFT dimensions: ( 375, 375, 375)
      Dynamical RAM for                 wfc:     235.91 MB
      Dynamical RAM for     wfc (w. buffer):     235.91 MB
      Dynamical RAM for           str. fact:       0.94 MB
      Dynamical RAM for           local pot:       0.00 MB
      Dynamical RAM for          nlocal pot:    2112.67 MB
      Dynamical RAM for                qrad:       0.80 MB
      Dynamical RAM for          rho,v,vnew:       6.04 MB
      Dynamical RAM for               rhoin:       2.01 MB
      Dynamical RAM for            rho*nmix:      15.03 MB
      Dynamical RAM for           G-vectors:       3.99 MB
      Dynamical RAM for          h,s,v(r/c):       0.46 MB
      Dynamical RAM for          <psi|beta>:     552.06 MB
      Dynamical RAM for      wfcinit/wfcrot:    1305.21 MB
      Estimated static dynamical RAM per process >       2.31 GB
      Estimated max dynamical RAM per process >       3.60 GB
      Estimated total dynamical RAM >    1441.34 GB

Thanks a lot in advance for your kind help.

All the best

Antonio

On 10. 05. 24 12:01, Paolo Giannozzi wrote:
> On 5/10/24 08:58, Antonio Cammarata via users wrote:
>
>> pw.x -nk 1 -nt 1 -nb 1 -nd 768 -inp qe.in > qe.out
>
> too many processors for linear-algebra parallelization. 1000 Si atoms 
> = 2000 bands (assuming an insulator with no spin polarization). Use a 
> few tens of processors at most
>
>> "some processors have no G-vectors for symmetrization". 
>
> which sounds strange to me: with the Gamma point symmetrization is not 
> even needed
>
>
>>       Dense  grid: 30754065 G-vectors FFT dimensions: ( 400, 400, 400)
>
> This is what a 256-atom Si supercell with 30 Ry cutoff yields:
>
>      Dense  grid:   825897 G-vectors     FFT dimensions: ( 162, 162, 162)
>
> I guess you may reduce the size of your supercell
>
> Paolo
>
>>       Dynamical RAM for wfc:     153.50 MB
>>       Dynamical RAM for     wfc (w. buffer):     153.50 MB
>>       Dynamical RAM for           str. fact:       0.61 MB
>>       Dynamical RAM for           local pot:       0.00 MB
>>       Dynamical RAM for          nlocal pot:    1374.66 MB
>>       Dynamical RAM for                qrad:       0.87 MB
>>       Dynamical RAM for          rho,v,vnew:       5.50 MB
>>       Dynamical RAM for               rhoin:       1.83 MB
>>       Dynamical RAM for            rho*nmix:       9.78 MB
>>       Dynamical RAM for           G-vectors:       2.60 MB
>>       Dynamical RAM for          h,s,v(r/c):       0.25 MB
>>       Dynamical RAM for          <psi|beta>:     552.06 MB
>>       Dynamical RAM for      wfcinit/wfcrot:     977.20 MB
>>       Estimated static dynamical RAM per process >       1.51 GB
>>       Estimated max dynamical RAM per process >       2.47 GB
>>       Estimated total dynamical RAM >    1900.41 GB
>>
>> I managed to run the simulation with 512 atoms, cg diagonalization 
>> and 3 nodes on the same machine with command line
>>
>> pw.x -nk 1 -nt 1 -nd 484 -inp qe.in > qe.out
>>
>> Please, do you have any suggestion on how to set optimal 
>> parallelization parameters to avoid the memory issue and run the 
>> calculation? I am also planning to run simulations on nanoclusters 
>> with more than 1000 atoms.
>>
>> Thanks a lot in advance for your kind help.
>>
>> Antonio
>>
>>
>
-- 
_______________________________________________
Antonio Cammarata, PhD in Physics
Associate Professor in Applied Physics
Advanced Materials Group
Department of Control Engineering - KN:G-204
Faculty of Electrical Engineering
Czech Technical University in Prague
Karlovo Náměstí, 13
121 35, Prague 2, Czech Republic
Phone: +420 224 35 5711
Fax:   +420 224 91 8646
ORCID: orcid.org/0000-0002-5691-0682
WoS ResearcherID: A-4883-2014