[Q-e-developers] Scalability of CP on BGQ (FERMI)

Tue Jul 31 06:54:58 CEST 2012

Dear Carlo,

On Jul 30, 2012, at 6:14 PM, Carlo Cavazzoni wrote:
> Number of  | Number of     | sec/      | OpenMP  | command line
> real cores | virtual cores | iteration | threads | parameters
>  4096      |  8192         |   231     |    4    | -nbgrp 2 -ntg 4 
> -ndiag 256
>  8192      | 16384         |   160     |    4    | -nbgrp 4 -ntg 4 
> -ndiag 1024
> 16384      | 32768         |   131     |    4    | -nbgrp 4 -ntg 4 
> -ndiag 1024
> 32768      | 65536         |    86     |    4    | -nbgrp 8 -ntg 4 
> -ndiag 2048

benchmarking GPU PWscf on medium/big systems (>500 atoms)  I found in the PW code several spots where adding OpenMP will improve the performance (of those sections) of a factor (at least) 2. I haven't committed anything yet. However, it is interesting to evaluate the OpenMP efficiency/scalability. I see you did tests using 4 OpenMP (I assume 8 MPI per A2 chip, 2 OpenMP thread per physical core, 2 GByte per RAM each task. correct?). What about 8 OpenMP threads? or 16? Is it worth to go over 4 OpenMP threads?

> (Volunteer are welcome too!)

I am more than happy to help (-:

F.

-- 
Mr. Filippo SPIGA (穗安駒), HPC and GPU Technologist <spiga.filippo_at_gmail.com>
website: http://filippospiga.me  ~  skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20120731/472f4004/attachment.html>