[Q-e-developers] Scalability of CP on BGQ (FERMI)

Carlo Cavazzoni c.cavazzoni at cineca.it
Tue Jul 31 09:22:33 CEST 2012


Dear Filippo,
dear all,

on BGQ, in order to obtain the maximum performance
out of the processors, one has to overload them with
2/4 threads or tasks per core.
In the test I've sent yesterday the number of tasks per core
was 8, so that the number of threads per node turns out to be 32
(running on 16 physical cores).
Now I've verified that a slightly better performances can be obtained
using 8 tasks and 8 threads per task, for a total of 64 threads
(again running on 16 physical core).
Moreover, it seems that on this machine the band parallelism is really 
effective
(this is what I was guessing when I've implemented it, but I wasn't 
sure...).
I report some new figures about a new test with an higher degree of 
overloading
(4) and more band groups:
Number of  | Number of     | sec/      | OpenMP  | command line
real cores | virtual cores | iteration | threads | parameters
32768      | 131072        |    65     |    8    | -nbgrp 16 -ntg 4 
-ndiag 512

Without changing the number of physical cores, we got a 25%
of improvement in the code performance (using two time more virtual cores
and two time more band groups).

Well, note that, we reach the limit in the number of digits for the
processors count:
      Parallel version (MPI & OpenMP), running on ***** processor cores
      Number of MPI processes:           16384
      Threads/MPI process:                  8
      band groups division:  nbgrp     =   16
      R & G space division:  proc/pool = ****
      wavefunctions fft division:  fft/group =    4

Finally, consider that the total number of virtual cores on FERMI
is equal to: 655360


best,
carlo






Il 31/07/2012 06:54, Filippo Spiga ha scritto:
> Dear Carlo,
>
> On Jul 30, 2012, at 6:14 PM, Carlo Cavazzoni wrote:
>> Number of  | Number of     | sec/      | OpenMP  | command line
>> real cores | virtual cores | iteration | threads | parameters
>>  4096      |  8192         |   231     |    4    | -nbgrp 2 -ntg 4
>> -ndiag 256
>>  8192      | 16384         |   160     |    4    | -nbgrp 4 -ntg 4
>> -ndiag 1024
>> 16384      | 32768         |   131     |    4    | -nbgrp 4 -ntg 4
>> -ndiag 1024
>> 32768      | 65536         |    86     |    4    | -nbgrp 8 -ntg 4
>> -ndiag 2048
>
> benchmarking GPU PWscf on medium/big systems (>500 atoms)  I found in 
> the PW code several spots where adding OpenMP will improve the 
> performance (of those sections) of a factor (at least) 2. I haven't 
> committed anything yet. However, it is interesting to evaluate the 
> OpenMP efficiency/scalability. I see you did tests using 4 OpenMP (I 
> assume 8 MPI per A2 chip, 2 OpenMP thread per physical core, 2 GByte 
> per RAM each task. correct?). What about 8 OpenMP threads? or 16? Is 
> it worth to go over 4 OpenMP threads?
>
>
>> (Volunteer are welcome too!)
>
> I am more than happy to help (-:
>
> F.
>
> -- 
> Mr. Filippo SPIGA (穗安駒), HPC and GPU 
> Technologist <spiga.filippo_at_gmail.com>
> website: http://filippospiga.me  ~  skype: filippo.spiga
>
> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
>


-- 
Ph.D. Carlo Cavazzoni
SuperComputing Applications and Innovation Department
CINECA - Via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna)
Tel: +39 051 6171411  Fax: +39 051 6132198
www.cineca.it

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20120731/2ecf9b1a/attachment.html>


More information about the developers mailing list