[Pw_forum] I/O performance on BG/P systems

Mon Apr 11 22:50:48 CEST 2011

Dear Paolo and Nichols,

    as a follow up, I had a brief meeting with the sysadmin of our local BGP. It looks like the timings I was reporting actually correspond to the maximum I/O throughput of that specific rack, which depends on the number of I/O nodes present on the rack itself (in that case, 4 I/O nodes per midplane, each of them capable of 350 MB/s, corresponding to 1.7 GB/s for that midplane).
In the example I was reporting:
     davcio       :   1083.30s CPU   1083.30s WALL (      38 calls)

I've been running on just 128 nodes (512 cores in VN mode), therefore I had only one I/O node (1 midplane = 4 x 128 nodes, for the non-BG/P-ers). Now, the total size of the .wfc files was around 9200 MB, which cannot be written in less than 9200/350 = 26.3 sec, according to the figures that the sysadmins gave me. 
In my case the timings give: 1083.30s/38=28.5s, which is close to the theoretical maximum. 
I will perform more testing and I will take into consideration the suggestion of Nichols about the number of files per node. In our machine we have one rack with 16 I/O nodes per midplane, I will try to see if the I/O performance scales accordingly.

As a side effect, I met a problem in the timing procedure. I found very different davcio timings (i.e. 3 orders of magnitude!) for two jobs where the size of the wavefunctions differed by a factor 2 only (the jobs have been executed on the same rack and with the same number of processors and same parallelization scheme). 
The sysadmins replied that I/O bandwidth measured in the fastest case is not attainable on BG/P, and should be imputed to an inaccurate measurement of cputime/walltime. 
I'm going to investigate this anyway.

I'm not aware of anyone working on MPI I/O porting.

Thanks so far for your suggestions,

Gabriele

Il giorno 11/apr/2011, alle ore 20.02, Nichols A. Romero ha scritto:

> Sorry for not replying earlier, but I missed this e-mail due to the
> APS March Meeting.
> 
> The GPFS file system on BG/P does a poor job at handling writes to more than
> one file per node. My guess is that Gabriele was running QE in either dual
> or VN mode (2 and 4 MPI tasks per node, respectively). So on BG/P,
> you basically
> want to write one file per node (which GPFS is designed to handle) or
> one big file
> using MPI-I/O.
> 
> At ANL, we are thinking about re-writing some of the I/O
> using parallel I/O (e.g. HDF5, Parallel NetCDF). The simplest
> approach, though highly
> unportable, is to use the MPI I/O directly.
> 
> Has anyone on this list worked on parallel I/O with QE? Or have any
> strong opinions
> on this issue?
> 
> 
> On Wed, Mar 30, 2011 at 11:57 AM, Paolo Giannozzi
> <giannozz at democritos.it> wrote:
>> 
>> On Mar 30, 2011, at 11:20 , Gabriele Sclauzero wrote:
>> 
>>> Do you think that having an additional optional level of I/O
>>> (let's say that it might be called "medium")
>> 
>> I propose 'rare', 'medium', 'well done'
>> 
>>> would be too confusing for users?
>> 
>> some users get confused no matter what
>> 
>>> I could try to implement and test it.
>> 
>> ok: just follow the "io_level" variable. Try first to understand
>> what the actual behavior is (the documentation is not so
>> clear on this point) and then think what it should be, if you
>> have some clear ideas
>> 
>> P.
>> ---
>> Paolo Giannozzi, Dept of Chemistry&Physics&Environment,
>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
>> Phone +39-0432-558216, fax +39-0432-558222
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://www.democritos.it/mailman/listinfo/pw_forum
>> 
> 
> 
> 
> -- 
> Nichols A. Romero, Ph.D.
> Argonne Leadership Computing Facility
> Argonne, IL 60490
> (630) 447-9793
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum

§ Gabriele Sclauzero, EPFL SB ITP CSEA
   PH H2 462, Station 3, CH-1015 Lausanne

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20110411/cedd5386/attachment.html>