<div dir="ltr"><div class="gmail_extra"><div style="font-family:times new roman,serif" class="gmail_default">Thank you Rolly for your comments<br><br>Previously I used both intel MKL and MPI. MPI (intel) was not running at all so that I switched to Openmpi. current version of my intel MKL library was "l_mkl_2018.1.163"<br><br>My linux-OS was Ubuntu-16.04 serever, Is OS also create some problem??<br><br>Can you explain Is there any difference between Parallel Studio XE inetel and above intel MKL (above version)??<br></div><div style="font-family:times new roman,serif" class="gmail_default"><br><br>(sorry , since it was so long time using pw-forum so I forgot that, This is my affiliation)<br></div><br><div style="font-family:times new roman,serif" class="gmail_default">Phanikumar<br></div><div style="font-family:times new roman,serif" class="gmail_default">Research scholar<br></div><div style="font-family:times new roman,serif" class="gmail_default">Department of Chemical engineering<br></div><div style="font-family:times new roman,serif" class="gmail_default">Indian Institute of Technology Kharagpur<br></div><div style="font-family:times new roman,serif" class="gmail_default">West Bengal<br></div><div style="font-family:times new roman,serif" class="gmail_default">India<br></div><br><div class="gmail_quote"><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Message: 4<br>
Date: Sun, 10 Dec 2017 09:01:59 +0530<br>
From: Phanikumar Pentyala <<a href="mailto:phani12.chem@gmail.com">phani12.chem@gmail.com</a>><br>
Subject: [Pw_forum] QE-GPU performance<br>
To: PWSCF Forum <<a href="mailto:pw_forum@pwscf.org">pw_forum@pwscf.org</a>><br>
Message-ID:<br>
<<a href="mailto:CAOgLYHHDQWV7JeYe17KBTwGwv4NVyNTJ-6XpqKfkVjXYbj8ELQ@mail.gmail.com">CAOgLYHHDQWV7JeYe17KBTwGwv4NV<wbr>yNTJ-6XpqKfkVjXYbj8ELQ@mail.<wbr>gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Dear users and developers<br>
<br>
Currently I am using two Tesla K40m cards for my computational work on<br>
quantum espresso (QE). My GPU enabled QE code running very slower than<br>
normal version. My question was weather particular application will be fast<br>
only in some versions of CUDA toolkit? (as mentioned in previous post:<br>
<a href="http://qe-forge.org/pipermail/pw_forum/2015-May/106889.html" rel="noreferrer" target="_blank">http://qe-forge.org/pipermail/<wbr>pw_forum/2015-May/106889.html</a>) OR is there<br>
any other reason hindering performance (memory) of GPU? (when I am hitting<br>
top command in my server, option of 'VIRT' showing different values (top<br>
command pasted in attached file))<br>
<br>
Some error was generating while submitting code that "A high-performance<br>
Open MPI point-to-point messaging module was unable to find any relevant<br>
network interfaces: Module: OpenFabrics (openib) Host: XXXX Another<br>
transport will be used instead, although this may result in lower<br>
performance". Is this MPI thread hindering GPU performance ?<br>
<br>
(P.S: We don't have any Infiband adapter HCA in server)<br>
<br>
<br>
Current details of server are (full details attached):<br>
<br>
Server: FUJITSU PRIMERGY RX2540 M2<br>
CUDA version: 9.0<br>
NVIDIA driver: 384.9<br>
openmpi version: 2.0.4 with intel mkl libraries<br>
QE-gpu version : 5.4.0<br>
<br>
<br>
Thanks in advance<br>
<br>
Regards<br>
Phanikumar<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <a href="http://pwscf.org/pipermail/pw_forum/attachments/20171210/91bedf7a/attachment-0001.html" rel="noreferrer" target="_blank">http://pwscf.org/pipermail/pw_<wbr>forum/attachments/20171210/<wbr>91bedf7a/attachment-0001.html</a><br>
-------------- next part --------------<br>
##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>
<br>
SERVER architecture information (from "lscpu" command in terminal)<br>
<br>
##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>
<br>
Architecture: x86_64<br>
CPU op-mode(s): 32-bit, 64-bit<br>
Byte Order: Little Endian<br>
CPU(s): 40<br>
On-line CPU(s) list: 0-39<br>
Thread(s) per core: 2<br>
Core(s) per socket: 10<br>
Socket(s): 2<br>
NUMA node(s): 2<br>
Vendor ID: GenuineIntel<br>
CPU family: 6<br>
Model: 79<br>
Model name: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz<br>
Stepping: 1<br>
CPU MHz: 1200.000<br>
BogoMIPS: 4788.53<br>
Virtualization: VT-x<br>
L1d cache: 32K<br>
L1i cache: 32K<br>
L2 cache: 256K<br>
L3 cache: 25600K<br>
NUMA node0 CPU(s): 0-9,20-29<br>
NUMA node1 CPU(s): 10-19,30-39<br>
<br>
<br>
##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>
<br>
After I run device quiry in CUDA_samples I got this information about my GPU accelerators<br>
<br>
##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>
<br>
CUDA Device Query (Runtime API) version (CUDART static linking)<br>
<br>
Detected 2 CUDA Capable device(s)<br>
<br>
Device 0: "Tesla K40m"<br>
CUDA Driver Version / Runtime Version 9.0 / 9.0<br>
CUDA Capability Major/Minor version number: 3.5<br>
Total amount of global memory: 11440 MBytes (11995578368 bytes)<br>
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores<br>
GPU Max Clock rate: 745 MHz (0.75 GHz)<br>
Memory Clock rate: 3004 Mhz<br>
Memory Bus Width: 384-bit<br>
L2 Cache Size: 1572864 bytes<br>
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)<br>
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers<br>
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers<br>
Total amount of constant memory: 65536 bytes<br>
Total amount of shared memory per block: 49152 bytes<br>
Total number of registers available per block: 65536<br>
Warp size: 32<br>
Maximum number of threads per multiprocessor: 2048<br>
Maximum number of threads per block: 1024<br>
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)<br>
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)<br>
Maximum memory pitch: 2147483647 bytes<br>
Texture alignment: 512 bytes<br>
Concurrent copy and kernel execution: Yes with 2 copy engine(s)<br>
Run time limit on kernels: No<br>
Integrated GPU sharing Host Memory: No<br>
Support host page-locked memory mapping: Yes<br>
Alignment requirement for Surfaces: Yes<br>
Device has ECC support: Enabled<br>
Device supports Unified Addressing (UVA): Yes<br>
Supports Cooperative Kernel Launch: No<br>
Supports MultiDevice Co-op Kernel Launch: No<br>
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0<br>
Compute Mode:<br>
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) ><br>
<br>
Device 1: "Tesla K40m"<br>
CUDA Driver Version / Runtime Version 9.0 / 9.0<br>
CUDA Capability Major/Minor version number: 3.5<br>
Total amount of global memory: 11440 MBytes (11995578368 bytes)<br>
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores<br>
GPU Max Clock rate: 745 MHz (0.75 GHz)<br>
Memory Clock rate: 3004 Mhz<br>
Memory Bus Width: 384-bit<br>
L2 Cache Size: 1572864 bytes<br>
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)<br>
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers<br>
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers<br>
Total amount of constant memory: 65536 bytes<br>
Total amount of shared memory per block: 49152 bytes<br>
Total number of registers available per block: 65536<br>
Warp size: 32<br>
Maximum number of threads per multiprocessor: 2048<br>
Maximum number of threads per block: 1024<br>
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)<br>
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)<br>
Maximum memory pitch: 2147483647 bytes<br>
Texture alignment: 512 bytes<br>
Concurrent copy and kernel execution: Yes with 2 copy engine(s)<br>
Run time limit on kernels: No<br>
Integrated GPU sharing Host Memory: No<br>
Support host page-locked memory mapping: Yes<br>
Alignment requirement for Surfaces: Yes<br>
Device has ECC support: Enabled<br>
Device supports Unified Addressing (UVA): Yes<br>
Supports Cooperative Kernel Launch: No<br>
Supports MultiDevice Co-op Kernel Launch: No<br>
Device PCI Domain ID / Bus ID / location ID: 0 / 129 / 0<br>
Compute Mode:<br>
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) ><br>
> Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : No<br>
> Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : No<br>
<br>
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2<br>
Result = PASS<br>
<br>
<br>
##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>
<br>
GPU performance after 'nvidia-smi' command in terminal<br>
<br>
##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>
<br>
+-----------------------------<wbr>------------------------------<wbr>------------------+<br>
| NVIDIA-SMI 384.90 Driver Version: 384.90 |<br>
|-----------------------------<wbr>--+----------------------+----<wbr>------------------+<br>
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |<br>
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |<br>
|=============================<wbr>==+======================+====<wbr>==================|<br>
| 0 Tesla K40m Off | 00000000:02:00.0 Off | 0 |<br>
| N/A 42C P0 75W / 235W | 11381MiB / 11439MiB | 83% Default |<br>
+-----------------------------<wbr>--+----------------------+----<wbr>------------------+<br>
| 1 Tesla K40m Off | 00000000:81:00.0 Off | 0 |<br>
| N/A 46C P0 75W / 235W | 11380MiB / 11439MiB | 87% Default |<br>
+-----------------------------<wbr>--+----------------------+----<wbr>------------------+<br>
<br>
<br>
##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>
<br>
TOP command if my server<br>
<br>
##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND<br>
20019 xxxxx 20 0 0.158t 426080 152952 R 100.3 0.3 36:29.44 pw-gpu.x<br>
20023 xxxxx 20 0 0.158t 422380 153328 R 100.0 0.3 36:29.42 pw-gpu.x<br>
20025 xxxxx 20 0 0.158t 418256 153376 R 100.0 0.3 36:27.74 pw-gpu.x<br>
20042 xxxxx 20 0 0.158t 416912 153104 R 100.0 0.3 36:24.63 pw-gpu.x<br>
20050 xxxxx 20 0 0.158t 412564 153084 R 100.0 0.3 36:25.68 pw-gpu.x<br>
20064 xxxxx 20 0 0.158t 408012 153100 R 100.0 0.3 36:25.54 pw-gpu.x<br>
20098 xxxxx 20 0 0.158t 398404 153436 R 100.0 0.3 36:27.92 pw-gpu.x<br>
<br>
<br>
------------------------------<br>
<br>
Message: 5<br>
Date: Sun, 10 Dec 2017 17:07:59 +0800<br>
From: Rolly Ng <<a href="mailto:rollyng@gmail.com">rollyng@gmail.com</a>><br>
Subject: Re: [Pw_forum] QE-GPU performance<br>
To: <a href="mailto:pw_forum@pwscf.org">pw_forum@pwscf.org</a><br>
Message-ID: <<a href="mailto:225411b4-1c48-6f24-954f-5d0af115e76f@gmail.com">225411b4-1c48-6f24-954f-<wbr>5d0af115e76f@gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Dear Phanikumar,<br>
<br>
Please include your affiliation when posting to the forum.<br>
<br>
In my experience with QE-GPU v5.3.0 and v5.4.0, the working combination<br>
of software is,<br>
<br>
1) Intel PSXE 2017<br>
<br>
2) CUDA 6.5 or 7.0<br>
<br>
3) Centos 7.1<br>
<br>
Please try the above combination.<br>
<br>
Regards,<br>
Rolly<br>
<br>
PhD. Research Fellow,<br>
Dept. of Physics & Materials Science,<br>
City University of Hong Kong<br>
Tel: +852 3442 4000<br>
Fax: +852 3442 0538<br>
<br>
On 12/10/2017 11:31 AM, Phanikumar Pentyala wrote:<br>
> Dear users and developers<br>
><br>
> Currently I am using two Tesla K40m cards for my computational work on<br>
> quantum espresso (QE). My GPU enabled QE code running very slower than<br>
> normal version. My question was weather particular application will be<br>
> fast only in some versions of CUDA toolkit? (as mentioned in previous<br>
> post: <a href="http://qe-forge.org/pipermail/pw_forum/2015-May/106889.html" rel="noreferrer" target="_blank">http://qe-forge.org/pipermail/<wbr>pw_forum/2015-May/106889.html</a>) OR<br>
> is there any other reason hindering performance (memory) of GPU? (when<br>
> I am hitting top command in my server, option of 'VIRT' showing<br>
> different values (top command pasted in attached file))<br>
><br>
> Some error was generating while submitting code that "A<br>
> high-performance Open MPI point-to-point messaging module was unable<br>
> to find any relevant network interfaces: Module: OpenFabrics (openib)?<br>
> Host: XXXX Another transport will be used instead, although this may<br>
> result in lower performance". Is this MPI thread hindering GPU<br>
> performance ?<br>
><br>
> (P.S: We don't have any Infiband adapter HCA in server)<br>
><br>
><br>
> Current details of server are (full details attached):<br>
><br>
> Server: FUJITSU PRIMERGY RX2540 M2<br>
> CUDA version: 9.0<br>
> NVIDIA driver: 384.9<br>
> openmpi version: 2.0.4 with intel mkl libraries<br>
> QE-gpu version : 5.4.0<br>
><br>
><br>
> Thanks in advance<br>
><br>
> Regards<br>
> Phanikumar<br>
><br>
><br>
> ______________________________<wbr>_________________<br>
> Pw_forum mailing list<br>
> <a href="mailto:Pw_forum@pwscf.org">Pw_forum@pwscf.org</a><br>
> <a href="http://pwscf.org/mailman/listinfo/pw_forum" rel="noreferrer" target="_blank">http://pwscf.org/mailman/<wbr>listinfo/pw_forum</a><br>
<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <a href="http://pwscf.org/pipermail/pw_forum/attachments/20171210/35e7e383/attachment-0001.html" rel="noreferrer" target="_blank">http://pwscf.org/pipermail/pw_<wbr>forum/attachments/20171210/<wbr>35e7e383/attachment-0001.html</a><br>
<br>
------------------------------<br>
<br>
______________________________<wbr>_________________<br>
Pw_forum mailing list<br>
<a href="mailto:Pw_forum@pwscf.org">Pw_forum@pwscf.org</a><br>
<a href="http://pwscf.org/mailman/listinfo/pw_forum" rel="noreferrer" target="_blank">http://pwscf.org/mailman/<wbr>listinfo/pw_forum</a><br>
<br>
End of Pw_forum Digest, Vol 125, Issue 8<br>
******************************<wbr>**********<br>
</blockquote></div><br></div></div>