[Pw_forum] QE-GPU performance
Phanikumar Pentyala
phani12.chem at gmail.com
Sun Dec 10 04:31:59 CET 2017
Dear users and developers
Currently I am using two Tesla K40m cards for my computational work on
quantum espresso (QE). My GPU enabled QE code running very slower than
normal version. My question was weather particular application will be fast
only in some versions of CUDA toolkit? (as mentioned in previous post:
http://qe-forge.org/pipermail/pw_forum/2015-May/106889.html) OR is there
any other reason hindering performance (memory) of GPU? (when I am hitting
top command in my server, option of 'VIRT' showing different values (top
command pasted in attached file))
Some error was generating while submitting code that "A high-performance
Open MPI point-to-point messaging module was unable to find any relevant
network interfaces: Module: OpenFabrics (openib) Host: XXXX Another
transport will be used instead, although this may result in lower
performance". Is this MPI thread hindering GPU performance ?
(P.S: We don't have any Infiband adapter HCA in server)
Current details of server are (full details attached):
Server: FUJITSU PRIMERGY RX2540 M2
CUDA version: 9.0
NVIDIA driver: 384.9
openmpi version: 2.0.4 with intel mkl libraries
QE-gpu version : 5.4.0
Thanks in advance
Regards
Phanikumar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20171210/91bedf7a/attachment.html>
-------------- next part --------------
##################################################################################################################################################
SERVER architecture information (from "lscpu" command in terminal)
##################################################################################################################################################
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
Stepping: 1
CPU MHz: 1200.000
BogoMIPS: 4788.53
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
##################################################################################################################################################
After I run device quiry in CUDA_samples I got this information about my GPU accelerators
##################################################################################################################################################
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "Tesla K40m"
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11440 MBytes (11995578368 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Tesla K40m"
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11440 MBytes (11995578368 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 129 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : No
> Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2
Result = PASS
##################################################################################################################################################
GPU performance after 'nvidia-smi' command in terminal
##################################################################################################################################################
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m Off | 00000000:02:00.0 Off | 0 |
| N/A 42C P0 75W / 235W | 11381MiB / 11439MiB | 83% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40m Off | 00000000:81:00.0 Off | 0 |
| N/A 46C P0 75W / 235W | 11380MiB / 11439MiB | 87% Default |
+-------------------------------+----------------------+----------------------+
##################################################################################################################################################
TOP command if my server
##################################################################################################################################################
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20019 xxxxx 20 0 0.158t 426080 152952 R 100.3 0.3 36:29.44 pw-gpu.x
20023 xxxxx 20 0 0.158t 422380 153328 R 100.0 0.3 36:29.42 pw-gpu.x
20025 xxxxx 20 0 0.158t 418256 153376 R 100.0 0.3 36:27.74 pw-gpu.x
20042 xxxxx 20 0 0.158t 416912 153104 R 100.0 0.3 36:24.63 pw-gpu.x
20050 xxxxx 20 0 0.158t 412564 153084 R 100.0 0.3 36:25.68 pw-gpu.x
20064 xxxxx 20 0 0.158t 408012 153100 R 100.0 0.3 36:25.54 pw-gpu.x
20098 xxxxx 20 0 0.158t 398404 153436 R 100.0 0.3 36:27.92 pw-gpu.x
More information about the users
mailing list