[QE-users] [QE-GPU] How to "Fill the CPU with OpenMP threads" to run QE-GPU

Anson Thomas thomasanson53 at gmail.com
Fri Oct 8 15:17:55 CEST 2021

I am trying to run GPU enabled QE (QE 6.8 running on Ubuntu 18.04.5 LTS
(GNU/Linux 4.15.0-135-generic x86_64) System Configuration: Processor:
Intel Xeon Gold 5120 CPU 2.20 GHz (2 Processor) RAM: 96 GB HDD: 6 TB
Graphics Card: NVIDIA Quadro P5000 (16 GB))

I am successfully able to run small jobs (with dynamical ram ~1GB).
However, when going to even larger systems (less than 16GB), the output
abruptly stops during the first iteration(attached below)

     Program PWSCF v.6.8 starts on  8Oct2021 at 10:33:9

     This program is part of the open-source Quantum ESPRESSO suite
     for quantum simulation of materials; please cite
         "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
         "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
         "P. Giannozzi et al., J. Chem. Phys. 152 154105 (2020);
          URL http://www.quantum-espresso.org",
     in publications or presentations arising from this work. More details

     Parallel version (MPI & OpenMP), running on     784 processor cores
     Number of MPI processes:                28
     Threads/MPI process:                    28

     MPI processes distributed on     1 nodes
     R & G space division:  proc/nbgrp/npool/nimage =      28
     43440 MiB available memory on the printing compute node when the
environment starts

     Reading input from 001.in
Warning: card &CELL ignored
Warning: card / ignored

     Current dimensions of program PWSCF are:
     Max number of different atomic species (ntypx) = 10
     Max number of k-points (npk) =  40000
     Max angular momentum in pseudopotentials (lmaxx) =  4
     file Ti.pbe-spn-rrkjus_psl.1.0.0.upf: wavefunction(s)  3S 3D

     gamma-point specific algorithms are used
     Found symmetry operation: I + ( -0.0000 -0.5000  0.0000)
     This is a supercell, fractional translations are disabled

     Subspace diagonalization in iterative solution of the eigenvalue
     a serial algorithm will be used

     Parallelization info
     sticks:   dense  smooth     PW     G-vecs:    dense   smooth      PW
     Min         637     232     57                81572    18102    2258
     Max         640     234     60                81588    18118    2266
     Sum       17865    6549   1633              2284245   507201   63345

     Using Slab Decomposition

     bravais-lattice index     =           14
     lattice parameter (alat)  =      21.0379  a.u.
     unit-cell volume          =    9204.2807 (a.u.)^3
     number of atoms/cell      =           36
     number of atomic types    =            2
     number of electrons       =       288.00
     number of Kohn-Sham states=          173
     kinetic-energy cutoff     =      55.0000  Ry
     charge density cutoff     =     600.0000  Ry
     scf convergence threshold =      1.0E-06
     mixing beta               =       0.4000
     number of iterations used =            8  local-TF  mixing
     energy convergence thresh.=      1.0E-04
     force convergence thresh. =      1.0E-03
     Exchange-correlation= PBE
                           (   1   4   3   4   0   0   0)
     nstep                     =          500

     GPU acceleration is ACTIVE.

     Message from routine print_cuda_info:
     High GPU oversubscription detected. Are you sure this is what you want?

     GPU used by master process:

        Device Number: 0
        Device name: Quadro P5000
        Compute capability : 61
        Ratio of single to double precision performance  : 32
        Memory Clock Rate (KHz): 4513000
        Memory Bus Width (bits): 256
        Peak Memory Bandwidth (GB/s): 288.83

     celldm(1)=  21.037943  celldm(2)=   1.000000  celldm(3)=   2.419041
     celldm(4)=  -0.766650  celldm(5)=  -0.766650  celldm(6)=   0.533303

     crystal axes: (cart. coord. in units of alat)
               a(1) = (   1.000000   0.000000   0.000000 )
               a(2) = (   0.533303   0.845924   0.000000 )
               a(3) = (  -1.854558  -1.023161   1.168553 )

     reciprocal axes: (cart. coord. in units 2 pi/alat)
               b(1) = (  1.000000 -0.630438  1.035056 )
               b(2) = ( -0.000000  1.182139  1.035056 )
               b(3) = (  0.000000  0.000000  0.855759 )

     PseudoPot. # 1 for Ti read from file:
     MD5 check sum: e281089c08e14b8efcf92e44a67ada65
     Pseudo is Ultrasoft + core correction, Zval = 12.0
     Generated using "atomic" code by A. Dal Corso  v.6.2.2
     Using radial grid of 1177 points,  6 beta functions with:
                l(1) =   0
                l(2) =   0
                l(3) =   1
                l(4) =   1
                l(5) =   2
                l(6) =   2
     Q(r) pseudized with 0 coefficients

     PseudoPot. # 2 for O  read from file:
     MD5 check sum: 91400c9766925bcf19f520983a725ff0
     Pseudo is Ultrasoft + core correction, Zval =  6.0
     Generated using "atomic" code by A. Dal Corso  v.6.3MaX
     Using radial grid of 1095 points,  4 beta functions with:
                l(1) =   0
                l(2) =   0
                l(3) =   1
                l(4) =   1
     Q(r) pseudized with 0 coefficients

     atomic species   valence    mass     pseudopotential
        Ti            12.00    47.86700     Ti( 1.00)
        O              6.00    15.99940     O ( 1.00)

     Starting magnetic structure
     atomic species   magnetization
        Ti           0.200
        O            0.000

     No symmetry found

                                    s                        frac. trans.

      isym =  1     identity

 cryst.   s( 1) = (     1          0          0      )
                  (     0          1          0      )
                  (     0          0          1      )

 cart.    s( 1) = (  1.0000000  0.0000000  0.0000000 )
                  (  0.0000000  1.0000000  0.0000000 )
                  (  0.0000000  0.0000000  1.0000000 )

     point group C_1 (1)
     there are  1 classes
     the character table:

A      1.00

     the symmetry operations in each class and the name of the first

     E        1

   Cartesian axes

     site n.     atom                  positions (alat units)
         1           O   tau(   1) = (  -0.8353365  -0.5987815   0.7050395
         2           Ti  tau(   2) = (  -0.6772809  -0.5115821   0.7050395
         3           O   tau(   3) = (  -0.5192254  -0.4243827   0.7050395
         4           Ti  tau(   4) = (  -0.9272815  -0.5115821   0.5842738
         5           O   tau(   5) = (  -0.7692260  -0.4243827   0.5842738
         6           O   tau(   6) = (  -0.3186838  -0.1758181   0.5842738
         7           O   tau(   7) = (  -0.4520098  -0.3872999   0.4635080
         8           Ti  tau(   8) = (  -0.2939543  -0.3001004   0.4635080
         9           O   tau(   9) = (  -0.1358987  -0.2129011   0.4635080
        10           O   tau(  10) = (  -0.5686844  -0.1758181   0.7050395
        11           Ti  tau(  11) = (  -0.4106289  -0.0886188   0.7050395
        12           O   tau(  12) = (  -0.2525734  -0.0014194   0.7050395
        13           Ti  tau(  13) = (  -0.6606296  -0.0886188   0.5842738
        14           O   tau(  14) = (  -0.5025740  -0.0014194   0.5842738
        15           O   tau(  15) = (  -0.0520318   0.2471452   0.5842738
        16           O   tau(  16) = (  -0.1853578   0.0356635   0.4635080
        17           Ti  tau(  17) = (  -0.0273023   0.1228629   0.4635080
        18           O   tau(  18) = (   0.1307533   0.2100623   0.4635080
        19           O   tau(  19) = (  -0.3353351  -0.5987815   0.7050395
        20           Ti  tau(  20) = (  -0.1772797  -0.5115821   0.7050395
        21           O   tau(  21) = (  -0.0192241  -0.4243827   0.7050395
        22           Ti  tau(  22) = (  -0.4272803  -0.5115821   0.5842738
        23           O   tau(  23) = (  -0.2692247  -0.4243827   0.5842738
        24           O   tau(  24) = (   0.1813175  -0.1758181   0.5842738
        25           O   tau(  25) = (   0.0479915  -0.3872999   0.4635080
        26           Ti  tau(  26) = (   0.2060470  -0.3001004   0.4635080
        27           O   tau(  27) = (   0.3641026  -0.2129011   0.4635080
        28           O   tau(  28) = (  -0.0686832  -0.1758181   0.7050395
        29           Ti  tau(  29) = (   0.0893724  -0.0886188   0.7050395
        30           O   tau(  30) = (   0.2474280  -0.0014194   0.7050395
        31           Ti  tau(  31) = (  -0.1606282  -0.0886188   0.5842738
        32           O   tau(  32) = (  -0.0025728  -0.0014194   0.5842738
        33           O   tau(  33) = (   0.4479695   0.2471452   0.5842738
        34           O   tau(  34) = (   0.3146435   0.0356635   0.4635080
        35           Ti  tau(  35) = (   0.4726991   0.1228629   0.4635080
        36           O   tau(  36) = (   0.6307546   0.2100623   0.4635080

   Crystallographic axes

     site n.     atom                  positions (cryst. coord.)
         1           O   tau(   1) = (  0.2719137  0.0219125  0.6033439  )
         2           Ti  tau(   2) = (  0.3749954  0.1249943  0.6033439  )
         3           O   tau(   3) = (  0.4780771  0.2280761  0.6033439  )
         4           Ti  tau(   4) = ( -0.0000046 -0.0000050  0.4999975  )
         5           O   tau(   5) = (  0.1030772  0.1030768  0.4999975  )
         6           O   tau(   6) = (  0.3969147  0.3969146  0.4999975  )
         7           O   tau(   7) = (  0.2719156  0.0219145  0.3966511  )
         8           Ti  tau(   8) = (  0.3749973  0.1249964  0.3966511  )
         9           O   tau(   9) = (  0.4780790  0.2280781  0.3966511  )
        10           O   tau(  10) = (  0.2719134  0.5219140  0.6033439  )
        11           Ti  tau(  11) = (  0.3749952  0.6249957  0.6033439  )
        12           O   tau(  12) = (  0.4780769  0.7280775  0.6033439  )
        13           Ti  tau(  13) = ( -0.0000048  0.4999964  0.4999975  )
        14           O   tau(  14) = (  0.1030769  0.6030781  0.4999975  )
        15           O   tau(  15) = (  0.3969145  0.8969160  0.4999975  )
        16           O   tau(  16) = (  0.2719153  0.5219160  0.3966511  )
        17           Ti  tau(  17) = (  0.3749970  0.6249978  0.3966511  )
        18           O   tau(  18) = (  0.4780787  0.7280796  0.3966511  )
        19           O   tau(  19) = (  0.7719150  0.0219125  0.6033439  )
        20           Ti  tau(  20) = (  0.8749966  0.1249943  0.6033439  )
        21           O   tau(  21) = (  0.9780784  0.2280761  0.6033439  )
        22           Ti  tau(  22) = (  0.4999967 -0.0000050  0.4999975  )
        23           O   tau(  23) = (  0.6030784  0.1030768  0.4999975  )
        24           O   tau(  24) = (  0.8969160  0.3969146  0.4999975  )
        25           O   tau(  25) = (  0.7719169  0.0219145  0.3966511  )
        26           Ti  tau(  26) = (  0.8749985  0.1249964  0.3966511  )
        27           O   tau(  27) = (  0.9780803  0.2280781  0.3966511  )
        28           O   tau(  28) = (  0.7719147  0.5219140  0.6033439  )
        29           Ti  tau(  29) = (  0.8749965  0.6249957  0.6033439  )
        30           O   tau(  30) = (  0.9780782  0.7280775  0.6033439  )
        31           Ti  tau(  31) = (  0.4999965  0.4999964  0.4999975  )
        32           O   tau(  32) = (  0.6030782  0.6030781  0.4999975  )
        33           O   tau(  33) = (  0.8969158  0.8969160  0.4999975  )
        34           O   tau(  34) = (  0.7719166  0.5219160  0.3966511  )
        35           Ti  tau(  35) = (  0.8749983  0.6249978  0.3966511  )
        36           O   tau(  36) = (  0.9780801  0.7280796  0.3966511  )

     number of k points=     1  Gaussian smearing, width (Ry)=  0.0100
                       cart. coord. in units 2pi/alat
        k(    1) = (   0.0000000   0.0000000   0.0000000), wk =   1.0000000

                       cryst. coord.
        k(    1) = (   0.0000000   0.0000000   0.0000000), wk =   1.0000000

     Dense  grid:  1142123 G-vectors     FFT dimensions: ( 180, 180, 400)

     Smooth grid:   253601 G-vectors     FFT dimensions: ( 100, 100, 243)

     Dynamical RAM for                 wfc:       2.99 MB

     Dynamical RAM for     wfc (w. buffer):       2.99 MB

     Dynamical RAM for           str. fact:       1.24 MB

     Dynamical RAM for           local pot:       0.00 MB

     Dynamical RAM for          nlocal pot:       7.05 MB

     Dynamical RAM for                qrad:       3.93 MB

     Dynamical RAM for          rho,v,vnew:      25.98 MB

     Dynamical RAM for               rhoin:       8.66 MB

     Dynamical RAM for           G-vectors:       2.40 MB

     Dynamical RAM for          h,s,v(r/c):       2.74 MB

     Dynamical RAM for          <psi|beta>:       0.54 MB

     Dynamical RAM for                 psi:       5.98 MB

     Dynamical RAM for                hpsi:       5.98 MB

     Dynamical RAM for                spsi:       5.98 MB

     Dynamical RAM for      wfcinit/wfcrot:       8.53 MB

     Dynamical RAM for           addusdens:     131.34 MB

     Dynamical RAM for          addusforce:     160.16 MB

     Estimated static dynamical RAM per process >      76.37 MB

     Estimated max dynamical RAM per process >     236.53 MB

     Estimated total dynamical RAM >       6.47 GB

     Check: negative core charge=   -0.000001
     Generating pointlists ...
     new r_m :   0.0722 (alat units)  1.5191 (a.u.) for type    1
     new r_m :   0.0722 (alat units)  1.5191 (a.u.) for type    2

     Initial potential from superposition of free atoms

     starting charge  287.98222, renormalised to  288.00000

     negative rho (up, down):  9.119E-05 6.477E-05
     Starting wfcs are  216 randomized atomic wfcs

     total cpu time spent up to now is       14.0 secs

     Self-consistent Calculation
[tb_dev] Currently allocated     2.23E+01 Mbytes, locked:    0 /   9
[tb_pin] Currently allocated     0.00E+00 Mbytes, locked:    0 /   0

     iteration #  1     ecut=    55.00 Ry     beta= 0.40
     Davidson diagonalization with overlap

---- Real-time Memory Report at c_bands before calling an iterative solver
           980 MiB given to the printing process from OS
             0 MiB allocation reported by mallinfo(arena+hblkhd)
         32000 MiB available memory on the node where the printing process
     GPU memory used/free/total (MiB): 11117 / 5152 / 16270
     ethr =  1.00E-02,  avg # of iterations =  1.5
The CRASH file generated says

     task #        24
     from  addusdens_gpu  : error #         1
      cannot allocate aux2_d

     task #        14
     from  addusdens_gpu  : error #         1
      cannot allocate aux2_d

     task #         5
     from  addusdens_gpu  : error #         1
      cannot allocate aux2_d

     task #         7
     from  addusdens_gpu  : error #         1
      cannot allocate aux2_d

     task #        15
     from  addusdens_gpu  : error #         1
      cannot allocate aux2_d

     task #        17
     from  addusdens_gpu  : error #         1
      cannot allocate aux2_d

     task #        10
     from  addusdens_gpu  : error #         1
      cannot allocate aux2_d

     task #         9
     from  addusdens_gpu  : error #         1
      cannot allocate aux2_d

     task #        12
     from  addusdens_gpu  : error #         1
      cannot allocate aux2_d

     task #         4
     from  addusdens_gpu  : error #         1
      cannot allocate aux2_d

     task #        13
     from  addusdens_gpu  : error #         1
      cannot allocate aux2_d

     task #        19
     from  addusdens_gpu  : error #         1
      cannot allocate aux2_d

Using -ndiag 1 and -ntg1 with pw.x also gave a similar output with the
following additional lines

     negative rho (up, down):  9.119E-05 6.477E-05
     Starting wfcs are  216 randomized atomic wfcs

     total cpu time spent up to now is       11.9 secs

     Self-consistent Calculation
[tb_dev] Currently allocated     3.21E+01 Mbytes, locked:    0 /   9
[tb_pin] Currently allocated     0.00E+00 Mbytes, locked:    0 /   0

     iteration #  1     ecut=    55.00 Ry     beta= 0.40
     Davidson diagonalization with overlap

---- Real-time Memory Report at c_bands before calling an iterative solver
          1036 MiB given to the printing process from OS
             0 MiB allocation reported by mallinfo(arena+hblkhd)
         36041 MiB available memory on the node where the printing process
     GPU memory used/free/total (MiB): 8915 / 7354 / 16270
     ethr =  1.00E-02,  avg # of iterations =  1.5
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[58344,1],12]
  Exit code:    127
I believe I am not "filling the CPUs with OpenMP threads", or running 1 MPI
on 1 GPU, as suggested in this document.

Can someone please give some suggestions? Sorry for the long post. I am
totally new to this field. Any help would be appreciated. Thanks in advance
*M.Sc. Chemistry, IIT Roorkee, India*
