[QE-users] [QE-GPU] How to "Fill the CPU with OpenMP threads" to run QE-GPU
Anson Thomas
thomasanson53 at gmail.com
Fri Oct 8 15:17:55 CEST 2021
I am trying to run GPU enabled QE (QE 6.8 running on Ubuntu 18.04.5 LTS
(GNU/Linux 4.15.0-135-generic x86_64) System Configuration: Processor:
Intel Xeon Gold 5120 CPU 2.20 GHz (2 Processor) RAM: 96 GB HDD: 6 TB
Graphics Card: NVIDIA Quadro P5000 (16 GB))
I am successfully able to run small jobs (with dynamical ram ~1GB).
However, when going to even larger systems (less than 16GB), the output
abruptly stops during the first iteration(attached below)
Program PWSCF v.6.8 starts on 8Oct2021 at 10:33:9
This program is part of the open-source Quantum ESPRESSO suite
for quantum simulation of materials; please cite
"P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
"P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
"P. Giannozzi et al., J. Chem. Phys. 152 154105 (2020);
URL http://www.quantum-espresso.org",
in publications or presentations arising from this work. More details
at
http://www.quantum-espresso.org/quote
Parallel version (MPI & OpenMP), running on 784 processor cores
Number of MPI processes: 28
Threads/MPI process: 28
MPI processes distributed on 1 nodes
R & G space division: proc/nbgrp/npool/nimage = 28
43440 MiB available memory on the printing compute node when the
environment starts
Reading input from 001.in
Warning: card &CELL ignored
Warning: card / ignored
Current dimensions of program PWSCF are:
Max number of different atomic species (ntypx) = 10
Max number of k-points (npk) = 40000
Max angular momentum in pseudopotentials (lmaxx) = 4
file Ti.pbe-spn-rrkjus_psl.1.0.0.upf: wavefunction(s) 3S 3D
renormalized
gamma-point specific algorithms are used
Found symmetry operation: I + ( -0.0000 -0.5000 0.0000)
This is a supercell, fractional translations are disabled
Subspace diagonalization in iterative solution of the eigenvalue
problem:
a serial algorithm will be used
Parallelization info
--------------------
sticks: dense smooth PW G-vecs: dense smooth PW
Min 637 232 57 81572 18102 2258
Max 640 234 60 81588 18118 2266
Sum 17865 6549 1633 2284245 507201 63345
Using Slab Decomposition
bravais-lattice index = 14
lattice parameter (alat) = 21.0379 a.u.
unit-cell volume = 9204.2807 (a.u.)^3
number of atoms/cell = 36
number of atomic types = 2
number of electrons = 288.00
number of Kohn-Sham states= 173
kinetic-energy cutoff = 55.0000 Ry
charge density cutoff = 600.0000 Ry
scf convergence threshold = 1.0E-06
mixing beta = 0.4000
number of iterations used = 8 local-TF mixing
energy convergence thresh.= 1.0E-04
force convergence thresh. = 1.0E-03
Exchange-correlation= PBE
( 1 4 3 4 0 0 0)
nstep = 500
GPU acceleration is ACTIVE.
Message from routine print_cuda_info:
High GPU oversubscription detected. Are you sure this is what you want?
GPU used by master process:
Device Number: 0
Device name: Quadro P5000
Compute capability : 61
Ratio of single to double precision performance : 32
Memory Clock Rate (KHz): 4513000
Memory Bus Width (bits): 256
Peak Memory Bandwidth (GB/s): 288.83
celldm(1)= 21.037943 celldm(2)= 1.000000 celldm(3)= 2.419041
celldm(4)= -0.766650 celldm(5)= -0.766650 celldm(6)= 0.533303
crystal axes: (cart. coord. in units of alat)
a(1) = ( 1.000000 0.000000 0.000000 )
a(2) = ( 0.533303 0.845924 0.000000 )
a(3) = ( -1.854558 -1.023161 1.168553 )
reciprocal axes: (cart. coord. in units 2 pi/alat)
b(1) = ( 1.000000 -0.630438 1.035056 )
b(2) = ( -0.000000 1.182139 1.035056 )
b(3) = ( 0.000000 0.000000 0.855759 )
PseudoPot. # 1 for Ti read from file:
../Ti.pbe-spn-rrkjus_psl.1.0.0.upf
MD5 check sum: e281089c08e14b8efcf92e44a67ada65
Pseudo is Ultrasoft + core correction, Zval = 12.0
Generated using "atomic" code by A. Dal Corso v.6.2.2
Using radial grid of 1177 points, 6 beta functions with:
l(1) = 0
l(2) = 0
l(3) = 1
l(4) = 1
l(5) = 2
l(6) = 2
Q(r) pseudized with 0 coefficients
PseudoPot. # 2 for O read from file:
../O.pbe-n-rrkjus_psl.1.0.0.upf
MD5 check sum: 91400c9766925bcf19f520983a725ff0
Pseudo is Ultrasoft + core correction, Zval = 6.0
Generated using "atomic" code by A. Dal Corso v.6.3MaX
Using radial grid of 1095 points, 4 beta functions with:
l(1) = 0
l(2) = 0
l(3) = 1
l(4) = 1
Q(r) pseudized with 0 coefficients
atomic species valence mass pseudopotential
Ti 12.00 47.86700 Ti( 1.00)
O 6.00 15.99940 O ( 1.00)
Starting magnetic structure
atomic species magnetization
Ti 0.200
O 0.000
No symmetry found
s frac. trans.
isym = 1 identity
cryst. s( 1) = ( 1 0 0 )
( 0 1 0 )
( 0 0 1 )
cart. s( 1) = ( 1.0000000 0.0000000 0.0000000 )
( 0.0000000 1.0000000 0.0000000 )
( 0.0000000 0.0000000 1.0000000 )
point group C_1 (1)
there are 1 classes
the character table:
E
A 1.00
the symmetry operations in each class and the name of the first
element:
E 1
identity
Cartesian axes
site n. atom positions (alat units)
1 O tau( 1) = ( -0.8353365 -0.5987815 0.7050395
)
2 Ti tau( 2) = ( -0.6772809 -0.5115821 0.7050395
)
3 O tau( 3) = ( -0.5192254 -0.4243827 0.7050395
)
4 Ti tau( 4) = ( -0.9272815 -0.5115821 0.5842738
)
5 O tau( 5) = ( -0.7692260 -0.4243827 0.5842738
)
6 O tau( 6) = ( -0.3186838 -0.1758181 0.5842738
)
7 O tau( 7) = ( -0.4520098 -0.3872999 0.4635080
)
8 Ti tau( 8) = ( -0.2939543 -0.3001004 0.4635080
)
9 O tau( 9) = ( -0.1358987 -0.2129011 0.4635080
)
10 O tau( 10) = ( -0.5686844 -0.1758181 0.7050395
)
11 Ti tau( 11) = ( -0.4106289 -0.0886188 0.7050395
)
12 O tau( 12) = ( -0.2525734 -0.0014194 0.7050395
)
13 Ti tau( 13) = ( -0.6606296 -0.0886188 0.5842738
)
14 O tau( 14) = ( -0.5025740 -0.0014194 0.5842738
)
15 O tau( 15) = ( -0.0520318 0.2471452 0.5842738
)
16 O tau( 16) = ( -0.1853578 0.0356635 0.4635080
)
17 Ti tau( 17) = ( -0.0273023 0.1228629 0.4635080
)
18 O tau( 18) = ( 0.1307533 0.2100623 0.4635080
)
19 O tau( 19) = ( -0.3353351 -0.5987815 0.7050395
)
20 Ti tau( 20) = ( -0.1772797 -0.5115821 0.7050395
)
21 O tau( 21) = ( -0.0192241 -0.4243827 0.7050395
)
22 Ti tau( 22) = ( -0.4272803 -0.5115821 0.5842738
)
23 O tau( 23) = ( -0.2692247 -0.4243827 0.5842738
)
24 O tau( 24) = ( 0.1813175 -0.1758181 0.5842738
)
25 O tau( 25) = ( 0.0479915 -0.3872999 0.4635080
)
26 Ti tau( 26) = ( 0.2060470 -0.3001004 0.4635080
)
27 O tau( 27) = ( 0.3641026 -0.2129011 0.4635080
)
28 O tau( 28) = ( -0.0686832 -0.1758181 0.7050395
)
29 Ti tau( 29) = ( 0.0893724 -0.0886188 0.7050395
)
30 O tau( 30) = ( 0.2474280 -0.0014194 0.7050395
)
31 Ti tau( 31) = ( -0.1606282 -0.0886188 0.5842738
)
32 O tau( 32) = ( -0.0025728 -0.0014194 0.5842738
)
33 O tau( 33) = ( 0.4479695 0.2471452 0.5842738
)
34 O tau( 34) = ( 0.3146435 0.0356635 0.4635080
)
35 Ti tau( 35) = ( 0.4726991 0.1228629 0.4635080
)
36 O tau( 36) = ( 0.6307546 0.2100623 0.4635080
)
Crystallographic axes
site n. atom positions (cryst. coord.)
1 O tau( 1) = ( 0.2719137 0.0219125 0.6033439 )
2 Ti tau( 2) = ( 0.3749954 0.1249943 0.6033439 )
3 O tau( 3) = ( 0.4780771 0.2280761 0.6033439 )
4 Ti tau( 4) = ( -0.0000046 -0.0000050 0.4999975 )
5 O tau( 5) = ( 0.1030772 0.1030768 0.4999975 )
6 O tau( 6) = ( 0.3969147 0.3969146 0.4999975 )
7 O tau( 7) = ( 0.2719156 0.0219145 0.3966511 )
8 Ti tau( 8) = ( 0.3749973 0.1249964 0.3966511 )
9 O tau( 9) = ( 0.4780790 0.2280781 0.3966511 )
10 O tau( 10) = ( 0.2719134 0.5219140 0.6033439 )
11 Ti tau( 11) = ( 0.3749952 0.6249957 0.6033439 )
12 O tau( 12) = ( 0.4780769 0.7280775 0.6033439 )
13 Ti tau( 13) = ( -0.0000048 0.4999964 0.4999975 )
14 O tau( 14) = ( 0.1030769 0.6030781 0.4999975 )
15 O tau( 15) = ( 0.3969145 0.8969160 0.4999975 )
16 O tau( 16) = ( 0.2719153 0.5219160 0.3966511 )
17 Ti tau( 17) = ( 0.3749970 0.6249978 0.3966511 )
18 O tau( 18) = ( 0.4780787 0.7280796 0.3966511 )
19 O tau( 19) = ( 0.7719150 0.0219125 0.6033439 )
20 Ti tau( 20) = ( 0.8749966 0.1249943 0.6033439 )
21 O tau( 21) = ( 0.9780784 0.2280761 0.6033439 )
22 Ti tau( 22) = ( 0.4999967 -0.0000050 0.4999975 )
23 O tau( 23) = ( 0.6030784 0.1030768 0.4999975 )
24 O tau( 24) = ( 0.8969160 0.3969146 0.4999975 )
25 O tau( 25) = ( 0.7719169 0.0219145 0.3966511 )
26 Ti tau( 26) = ( 0.8749985 0.1249964 0.3966511 )
27 O tau( 27) = ( 0.9780803 0.2280781 0.3966511 )
28 O tau( 28) = ( 0.7719147 0.5219140 0.6033439 )
29 Ti tau( 29) = ( 0.8749965 0.6249957 0.6033439 )
30 O tau( 30) = ( 0.9780782 0.7280775 0.6033439 )
31 Ti tau( 31) = ( 0.4999965 0.4999964 0.4999975 )
32 O tau( 32) = ( 0.6030782 0.6030781 0.4999975 )
33 O tau( 33) = ( 0.8969158 0.8969160 0.4999975 )
34 O tau( 34) = ( 0.7719166 0.5219160 0.3966511 )
35 Ti tau( 35) = ( 0.8749983 0.6249978 0.3966511 )
36 O tau( 36) = ( 0.9780801 0.7280796 0.3966511 )
number of k points= 1 Gaussian smearing, width (Ry)= 0.0100
cart. coord. in units 2pi/alat
k( 1) = ( 0.0000000 0.0000000 0.0000000), wk = 1.0000000
cryst. coord.
k( 1) = ( 0.0000000 0.0000000 0.0000000), wk = 1.0000000
Dense grid: 1142123 G-vectors FFT dimensions: ( 180, 180, 400)
Smooth grid: 253601 G-vectors FFT dimensions: ( 100, 100, 243)
Dynamical RAM for wfc: 2.99 MB
Dynamical RAM for wfc (w. buffer): 2.99 MB
Dynamical RAM for str. fact: 1.24 MB
Dynamical RAM for local pot: 0.00 MB
Dynamical RAM for nlocal pot: 7.05 MB
Dynamical RAM for qrad: 3.93 MB
Dynamical RAM for rho,v,vnew: 25.98 MB
Dynamical RAM for rhoin: 8.66 MB
Dynamical RAM for G-vectors: 2.40 MB
Dynamical RAM for h,s,v(r/c): 2.74 MB
Dynamical RAM for <psi|beta>: 0.54 MB
Dynamical RAM for psi: 5.98 MB
Dynamical RAM for hpsi: 5.98 MB
Dynamical RAM for spsi: 5.98 MB
Dynamical RAM for wfcinit/wfcrot: 8.53 MB
Dynamical RAM for addusdens: 131.34 MB
Dynamical RAM for addusforce: 160.16 MB
Estimated static dynamical RAM per process > 76.37 MB
Estimated max dynamical RAM per process > 236.53 MB
Estimated total dynamical RAM > 6.47 GB
Check: negative core charge= -0.000001
Generating pointlists ...
new r_m : 0.0722 (alat units) 1.5191 (a.u.) for type 1
new r_m : 0.0722 (alat units) 1.5191 (a.u.) for type 2
Initial potential from superposition of free atoms
starting charge 287.98222, renormalised to 288.00000
negative rho (up, down): 9.119E-05 6.477E-05
Starting wfcs are 216 randomized atomic wfcs
total cpu time spent up to now is 14.0 secs
Self-consistent Calculation
[tb_dev] Currently allocated 2.23E+01 Mbytes, locked: 0 / 9
[tb_pin] Currently allocated 0.00E+00 Mbytes, locked: 0 / 0
iteration # 1 ecut= 55.00 Ry beta= 0.40
Davidson diagonalization with overlap
---- Real-time Memory Report at c_bands before calling an iterative solver
980 MiB given to the printing process from OS
0 MiB allocation reported by mallinfo(arena+hblkhd)
32000 MiB available memory on the node where the printing process
lives
GPU memory used/free/total (MiB): 11117 / 5152 / 16270
------------------
ethr = 1.00E-02, avg # of iterations = 1.5
The CRASH file generated says
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 24
from addusdens_gpu : error # 1
cannot allocate aux2_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 14
from addusdens_gpu : error # 1
cannot allocate aux2_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 5
from addusdens_gpu : error # 1
cannot allocate aux2_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 7
from addusdens_gpu : error # 1
cannot allocate aux2_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 15
from addusdens_gpu : error # 1
cannot allocate aux2_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 17
from addusdens_gpu : error # 1
cannot allocate aux2_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 10
from addusdens_gpu : error # 1
cannot allocate aux2_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 9
from addusdens_gpu : error # 1
cannot allocate aux2_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 12
from addusdens_gpu : error # 1
cannot allocate aux2_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 4
from addusdens_gpu : error # 1
cannot allocate aux2_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 13
from addusdens_gpu : error # 1
cannot allocate aux2_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 19
from addusdens_gpu : error # 1
cannot allocate aux2_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Using -ndiag 1 and -ntg1 with pw.x also gave a similar output with the
following additional lines
negative rho (up, down): 9.119E-05 6.477E-05
Starting wfcs are 216 randomized atomic wfcs
total cpu time spent up to now is 11.9 secs
Self-consistent Calculation
[tb_dev] Currently allocated 3.21E+01 Mbytes, locked: 0 / 9
[tb_pin] Currently allocated 0.00E+00 Mbytes, locked: 0 / 0
iteration # 1 ecut= 55.00 Ry beta= 0.40
Davidson diagonalization with overlap
---- Real-time Memory Report at c_bands before calling an iterative solver
1036 MiB given to the printing process from OS
0 MiB allocation reported by mallinfo(arena+hblkhd)
36041 MiB available memory on the node where the printing process
lives
GPU memory used/free/total (MiB): 8915 / 7354 / 16270
------------------
ethr = 1.00E-02, avg # of iterations = 1.5
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory)
0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory)
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:
Process name: [[58344,1],12]
Exit code: 127
--------------------------------------------------------------------------
I believe I am not "filling the CPUs with OpenMP threads", or running 1 MPI
on 1 GPU, as suggested in this document.
Can someone please give some suggestions? Sorry for the long post. I am
totally new to this field. Any help would be appreciated. Thanks in advance
--
Sent by *ANSON THOMAS*
*M.Sc. Chemistry, IIT Roorkee, India*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20211008/cee500e8/attachment.html>
More information about the users
mailing list