[Pw_forum] I/O performance on BG/P systems

Tue Apr 12 00:15:33 CEST 2011

Gabriele,

A couple of other technical details that I am remembering:
1. The BG/P at ANL has 1 I/O node per 64 nodes, 8 I/O nodes per mid-plane
2. The bottleneck of writing multiple files per MPI tasks does not
become serious
    until about 8+ racks.

On Mon, Apr 11, 2011 at 3:50 PM, Gabriele Sclauzero <sclauzer at sissa.it> wrote:
> Dear Paolo and Nichols,
>     as a follow up, I had a brief meeting with the sysadmin of our local
> BGP. It looks like the timings I was reporting actually correspond to the
> maximum I/O throughput of that specific rack, which depends on the number of
> I/O nodes present on the rack itself (in that case, 4 I/O nodes per
> midplane, each of them capable of 350 MB/s, corresponding to 1.7 GB/s for
> that midplane).
> In the example I was reporting:
>      davcio       :   1083.30s CPU   1083.30s WALL (      38 calls)
> I've been running on just 128 nodes (512 cores in VN mode), therefore I had
> only one I/O node (1 midplane = 4 x 128 nodes, for the non-BG/P-ers). Now,
> the total size of the .wfc files was around 9200 MB, which cannot be written
> in less than 9200/350 = 26.3 sec, according to the figures that the
> sysadmins gave me.
> In my case the timings give: 1083.30s/38=28.5s, which is close to the
> theoretical maximum.
> I will perform more testing and I will take into consideration the
> suggestion of Nichols about the number of files per node. In our machine we
> have one rack with 16 I/O nodes per midplane, I will try to see if the I/O
> performance scales accordingly.
> As a side effect, I met a problem in the timing procedure. I found very
> different davcio timings (i.e. 3 orders of magnitude!) for two jobs where
> the size of the wavefunctions differed by a factor 2 only (the jobs have
> been executed on the same rack and with the same number of processors and
> same parallelization scheme).
> The sysadmins replied that I/O bandwidth measured in the fastest case is not
> attainable on BG/P, and should be imputed to an inaccurate measurement of
> cputime/walltime.
> I'm going to investigate this anyway.
> I'm not aware of anyone working on MPI I/O porting.
> Thanks so far for your suggestions,
>
>
> Gabriele
>
>
> Il giorno 11/apr/2011, alle ore 20.02, Nichols A. Romero ha scritto:
>
> Sorry for not replying earlier, but I missed this e-mail due to the
> APS March Meeting.
>
> The GPFS file system on BG/P does a poor job at handling writes to more than
> one file per node. My guess is that Gabriele was running QE in either dual
> or VN mode (2 and 4 MPI tasks per node, respectively). So on BG/P,
> you basically
> want to write one file per node (which GPFS is designed to handle) or
> one big file
> using MPI-I/O.
>
> At ANL, we are thinking about re-writing some of the I/O
> using parallel I/O (e.g. HDF5, Parallel NetCDF). The simplest
> approach, though highly
> unportable, is to use the MPI I/O directly.
>
> Has anyone on this list worked on parallel I/O with QE? Or have any
> strong opinions
> on this issue?
>
>
> On Wed, Mar 30, 2011 at 11:57 AM, Paolo Giannozzi
> <giannozz at democritos.it> wrote:
>
> On Mar 30, 2011, at 11:20 , Gabriele Sclauzero wrote:
>
> Do you think that having an additional optional level of I/O
>
> (let's say that it might be called "medium")
>
> I propose 'rare', 'medium', 'well done'
>
> would be too confusing for users?
>
> some users get confused no matter what
>
> I could try to implement and test it.
>
> ok: just follow the "io_level" variable. Try first to understand
>
> what the actual behavior is (the documentation is not so
>
> clear on this point) and then think what it should be, if you
>
> have some clear ideas
>
> P.
>
> ---
>
> Paolo Giannozzi, Dept of Chemistry&Physics&Environment,
>
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
>
> Phone +39-0432-558216, fax +39-0432-558222
>
>
>
>
> _______________________________________________
>
> Pw_forum mailing list
>
> Pw_forum at pwscf.org
>
> http://www.democritos.it/mailman/listinfo/pw_forum
>
>
>
>
> --
> Nichols A. Romero, Ph.D.
> Argonne Leadership Computing Facility
> Argonne, IL 60490
> (630) 447-9793
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum
>
>
> § Gabriele Sclauzero, EPFL SB ITP CSEA
>    PH H2 462, Station 3, CH-1015 Lausanne
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum
>
>


-- 
Nichols A. Romero, Ph.D.
Argonne Leadership Computing Facility
Argonne, IL 60490
(630) 447-9793