<div dir="ltr"><div class="gmail_extra"><div style="font-family:times new roman,serif" class="gmail_default">Thank you Rolly for your comments<br><br>Previously I used both intel MKL and MPI. MPI (intel) was not running at all so that I switched to Openmpi. current version of my intel MKL library was "l_mkl_2018.1.163"<br><br>My linux-OS was Ubuntu-16.04 serever, Is OS also create some problem??<br><br>Can you explain Is there any difference between Parallel Studio XE inetel and above intel MKL (above version)??<br></div><div style="font-family:times new roman,serif" class="gmail_default"><br><br>(sorry , since it was so long time using pw-forum so I forgot that, This is my affiliation)<br></div><br><div style="font-family:times new roman,serif" class="gmail_default">Phanikumar<br></div><div style="font-family:times new roman,serif" class="gmail_default">Research scholar<br></div><div style="font-family:times new roman,serif" class="gmail_default">Department of Chemical engineering<br></div><div style="font-family:times new roman,serif" class="gmail_default">Indian Institute of Technology Kharagpur<br></div><div style="font-family:times new roman,serif" class="gmail_default">West Bengal<br></div><div style="font-family:times new roman,serif" class="gmail_default">India<br></div><br><div class="gmail_quote"><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Message: 4<br>

Date: Sun, 10 Dec 2017 09:01:59 +0530<br>

From: Phanikumar Pentyala <<a href="mailto:phani12.chem@gmail.com">phani12.chem@gmail.com</a>><br>

Subject: [Pw_forum] QE-GPU performance<br>

To: PWSCF Forum <<a href="mailto:pw_forum@pwscf.org">pw_forum@pwscf.org</a>><br>

Message-ID:<br>

        <<a href="mailto:CAOgLYHHDQWV7JeYe17KBTwGwv4NVyNTJ-6XpqKfkVjXYbj8ELQ@mail.gmail.com">CAOgLYHHDQWV7JeYe17KBTwGwv4NV<wbr>yNTJ-6XpqKfkVjXYbj8ELQ@mail.<wbr>gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

Dear users and developers<br>

<br>

Currently I am using two Tesla K40m cards for my computational work on<br>

quantum espresso (QE). My GPU enabled QE code running very slower than<br>

normal version. My question was weather particular application will be fast<br>

only in some versions of CUDA toolkit? (as mentioned in previous post:<br>

<a href="http://qe-forge.org/pipermail/pw_forum/2015-May/106889.html" rel="noreferrer" target="_blank">http://qe-forge.org/pipermail/<wbr>pw_forum/2015-May/106889.html</a>) OR is there<br>

any other reason hindering performance (memory) of GPU? (when I am hitting<br>

top command in my server, option of 'VIRT' showing different values (top<br>

command pasted in attached file))<br>

<br>

Some error was generating while submitting code that "A high-performance<br>

Open MPI point-to-point messaging module was unable to find any relevant<br>

network interfaces: Module: OpenFabrics (openib)  Host: XXXX Another<br>

transport will be used instead, although this may result in lower<br>

performance".  Is this MPI thread hindering GPU performance ?<br>

<br>

(P.S: We don't have any Infiband adapter HCA in server)<br>

<br>

<br>

Current details of server are (full details attached):<br>

<br>

Server: FUJITSU PRIMERGY RX2540 M2<br>

CUDA version: 9.0<br>

NVIDIA driver: 384.9<br>

openmpi version: 2.0.4 with intel mkl libraries<br>

QE-gpu version : 5.4.0<br>

<br>

<br>

Thanks in advance<br>

<br>

Regards<br>

Phanikumar<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <a href="http://pwscf.org/pipermail/pw_forum/attachments/20171210/91bedf7a/attachment-0001.html" rel="noreferrer" target="_blank">http://pwscf.org/pipermail/pw_<wbr>forum/attachments/20171210/<wbr>91bedf7a/attachment-0001.html</a><br>

-------------- next part --------------<br>

##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>

<br>

SERVER architecture information (from "lscpu" command in terminal)<br>

<br>

##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>

<br>

Architecture:          x86_64<br>

CPU op-mode(s):        32-bit, 64-bit<br>

Byte Order:            Little Endian<br>

CPU(s):                40<br>

On-line CPU(s) list:   0-39<br>

Thread(s) per core:    2<br>

Core(s) per socket:    10<br>

Socket(s):             2<br>

NUMA node(s):          2<br>

Vendor ID:             GenuineIntel<br>

CPU family:            6<br>

Model:                 79<br>

Model name:            Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz<br>

Stepping:              1<br>

CPU MHz:               1200.000<br>

BogoMIPS:              4788.53<br>

Virtualization:        VT-x<br>

L1d cache:             32K<br>

L1i cache:             32K<br>

L2 cache:              256K<br>

L3 cache:              25600K<br>

NUMA node0 CPU(s):     0-9,20-29<br>

NUMA node1 CPU(s):     10-19,30-39<br>

<br>

<br>

##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>

<br>

After I run device quiry in CUDA_samples I got this information about my GPU accelerators<br>

<br>

##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>

<br>

 CUDA Device Query (Runtime API) version (CUDART static linking)<br>

<br>

Detected 2 CUDA Capable device(s)<br>

<br>

Device 0: "Tesla K40m"<br>

  CUDA Driver Version / Runtime Version          9.0 / 9.0<br>

  CUDA Capability Major/Minor version number:    3.5<br>

  Total amount of global memory:                 11440 MBytes (11995578368 bytes)<br>

  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores<br>

  GPU Max Clock rate:                            745 MHz (0.75 GHz)<br>

  Memory Clock rate:                             3004 Mhz<br>

  Memory Bus Width:                              384-bit<br>

  L2 Cache Size:                                 1572864 bytes<br>

  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)<br>

  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers<br>

  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers<br>

  Total amount of constant memory:               65536 bytes<br>

  Total amount of shared memory per block:       49152 bytes<br>

  Total number of registers available per block: 65536<br>

  Warp size:                                     32<br>

  Maximum number of threads per multiprocessor:  2048<br>

  Maximum number of threads per block:           1024<br>

  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)<br>

  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)<br>

  Maximum memory pitch:                          2147483647 bytes<br>

  Texture alignment:                             512 bytes<br>

  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)<br>

  Run time limit on kernels:                     No<br>

  Integrated GPU sharing Host Memory:            No<br>

  Support host page-locked memory mapping:       Yes<br>

  Alignment requirement for Surfaces:            Yes<br>

  Device has ECC support:                        Enabled<br>

  Device supports Unified Addressing (UVA):      Yes<br>

  Supports Cooperative Kernel Launch:            No<br>

  Supports MultiDevice Co-op Kernel Launch:      No<br>

  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0<br>

  Compute Mode:<br>

     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) ><br>

<br>

Device 1: "Tesla K40m"<br>

  CUDA Driver Version / Runtime Version          9.0 / 9.0<br>

  CUDA Capability Major/Minor version number:    3.5<br>

  Total amount of global memory:                 11440 MBytes (11995578368 bytes)<br>

  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores<br>

  GPU Max Clock rate:                            745 MHz (0.75 GHz)<br>

  Memory Clock rate:                             3004 Mhz<br>

  Memory Bus Width:                              384-bit<br>

  L2 Cache Size:                                 1572864 bytes<br>

  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)<br>

  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers<br>

  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers<br>

  Total amount of constant memory:               65536 bytes<br>

  Total amount of shared memory per block:       49152 bytes<br>

  Total number of registers available per block: 65536<br>

  Warp size:                                     32<br>

  Maximum number of threads per multiprocessor:  2048<br>

  Maximum number of threads per block:           1024<br>

  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)<br>

  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)<br>

  Maximum memory pitch:                          2147483647 bytes<br>

  Texture alignment:                             512 bytes<br>

  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)<br>

  Run time limit on kernels:                     No<br>

  Integrated GPU sharing Host Memory:            No<br>

  Support host page-locked memory mapping:       Yes<br>

  Alignment requirement for Surfaces:            Yes<br>

  Device has ECC support:                        Enabled<br>

  Device supports Unified Addressing (UVA):      Yes<br>

  Supports Cooperative Kernel Launch:            No<br>

  Supports MultiDevice Co-op Kernel Launch:      No<br>

  Device PCI Domain ID / Bus ID / location ID:   0 / 129 / 0<br>

  Compute Mode:<br>

     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) ><br>

> Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : No<br>

> Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : No<br>

<br>

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2<br>

Result = PASS<br>

<br>

<br>

##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>

<br>

GPU performance after 'nvidia-smi' command in terminal<br>

<br>

##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>

<br>

+-----------------------------<wbr>------------------------------<wbr>------------------+<br>

| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |<br>

|-----------------------------<wbr>--+----------------------+----<wbr>------------------+<br>

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |<br>

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |<br>

|=============================<wbr>==+======================+====<wbr>==================|<br>

|   0  Tesla K40m          Off  | 00000000:02:00.0 Off |                    0 |<br>

| N/A   42C    P0    75W / 235W |  11381MiB / 11439MiB |     83%      Default |<br>

+-----------------------------<wbr>--+----------------------+----<wbr>------------------+<br>

|   1  Tesla K40m          Off  | 00000000:81:00.0 Off |                    0 |<br>

| N/A   46C    P0    75W / 235W |  11380MiB / 11439MiB |     87%      Default |<br>

+-----------------------------<wbr>--+----------------------+----<wbr>------------------+<br>

<br>

<br>

##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>

<br>

TOP command if my server<br>

<br>

##############################<wbr>##############################<wbr>##############################<wbr>##############################<wbr>##########################<br>

PID   USER      PR  NI   VIRT    RES   SHR   S %CPU  %MEM     TIME+ COMMAND<br>

20019 xxxxx     20   0  0.158t 426080 152952 R 100.3  0.3  36:29.44 pw-gpu.x<br>

20023 xxxxx     20   0  0.158t 422380 153328 R 100.0  0.3  36:29.42 pw-gpu.x<br>

20025 xxxxx     20   0  0.158t 418256 153376 R 100.0  0.3  36:27.74 pw-gpu.x<br>

20042 xxxxx     20   0  0.158t 416912 153104 R 100.0  0.3  36:24.63 pw-gpu.x<br>

20050 xxxxx     20   0  0.158t 412564 153084 R 100.0  0.3  36:25.68 pw-gpu.x<br>

20064 xxxxx     20   0  0.158t 408012 153100 R 100.0  0.3  36:25.54 pw-gpu.x<br>

20098 xxxxx     20   0  0.158t 398404 153436 R 100.0  0.3  36:27.92 pw-gpu.x<br>

<br>

<br>

------------------------------<br>

<br>

Message: 5<br>

Date: Sun, 10 Dec 2017 17:07:59 +0800<br>

From: Rolly Ng <<a href="mailto:rollyng@gmail.com">rollyng@gmail.com</a>><br>

Subject: Re: [Pw_forum] QE-GPU performance<br>

To: <a href="mailto:pw_forum@pwscf.org">pw_forum@pwscf.org</a><br>

Message-ID: <<a href="mailto:225411b4-1c48-6f24-954f-5d0af115e76f@gmail.com">225411b4-1c48-6f24-954f-<wbr>5d0af115e76f@gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

Dear Phanikumar,<br>

<br>

Please include your affiliation when posting to the forum.<br>

<br>

In my experience with QE-GPU v5.3.0 and v5.4.0, the working combination<br>

of software is,<br>

<br>

1) Intel PSXE 2017<br>

<br>

2) CUDA 6.5 or 7.0<br>

<br>

3) Centos 7.1<br>

<br>

Please try the above combination.<br>

<br>

Regards,<br>

Rolly<br>

<br>

PhD. Research Fellow,<br>

Dept. of Physics & Materials Science,<br>

City University of Hong Kong<br>

Tel: +852 3442 4000<br>

Fax: +852 3442 0538<br>

<br>

On 12/10/2017 11:31 AM, Phanikumar Pentyala wrote:<br>

> Dear users and developers<br>

><br>

> Currently I am using two Tesla K40m cards for my computational work on<br>

> quantum espresso (QE). My GPU enabled QE code running very slower than<br>

> normal version. My question was weather particular application will be<br>

> fast only in some versions of CUDA toolkit? (as mentioned in previous<br>

> post: <a href="http://qe-forge.org/pipermail/pw_forum/2015-May/106889.html" rel="noreferrer" target="_blank">http://qe-forge.org/pipermail/<wbr>pw_forum/2015-May/106889.html</a>) OR<br>

> is there any other reason hindering performance (memory) of GPU? (when<br>

> I am hitting top command in my server, option of 'VIRT' showing<br>

> different values (top command pasted in attached file))<br>

><br>

> Some error was generating while submitting code that "A<br>

> high-performance Open MPI point-to-point messaging module was unable<br>

> to find any relevant network interfaces: Module: OpenFabrics (openib)?<br>

> Host: XXXX Another transport will be used instead, although this may<br>

> result in lower performance". Is this MPI thread hindering GPU<br>

> performance ?<br>

><br>

> (P.S: We don't have any Infiband adapter HCA in server)<br>

><br>

><br>

> Current details of server are (full details attached):<br>

><br>

> Server: FUJITSU PRIMERGY RX2540 M2<br>

> CUDA version: 9.0<br>

> NVIDIA driver: 384.9<br>

> openmpi version: 2.0.4 with intel mkl libraries<br>

> QE-gpu version : 5.4.0<br>

><br>

><br>

> Thanks in advance<br>

><br>

> Regards<br>

> Phanikumar<br>

><br>

><br>

> ______________________________<wbr>_________________<br>

> Pw_forum mailing list<br>

> <a href="mailto:Pw_forum@pwscf.org">Pw_forum@pwscf.org</a><br>

> <a href="http://pwscf.org/mailman/listinfo/pw_forum" rel="noreferrer" target="_blank">http://pwscf.org/mailman/<wbr>listinfo/pw_forum</a><br>

<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <a href="http://pwscf.org/pipermail/pw_forum/attachments/20171210/35e7e383/attachment-0001.html" rel="noreferrer" target="_blank">http://pwscf.org/pipermail/pw_<wbr>forum/attachments/20171210/<wbr>35e7e383/attachment-0001.html</a><br>

<br>

------------------------------<br>

<br>

______________________________<wbr>_________________<br>

Pw_forum mailing list<br>

<a href="mailto:Pw_forum@pwscf.org">Pw_forum@pwscf.org</a><br>

<a href="http://pwscf.org/mailman/listinfo/pw_forum" rel="noreferrer" target="_blank">http://pwscf.org/mailman/<wbr>listinfo/pw_forum</a><br>

<br>

End of Pw_forum Digest, Vol 125, Issue 8<br>

******************************<wbr>**********<br>

</blockquote></div><br></div></div>