[Pw_forum] I/O performance on BG/P systems
sclauzer at sissa.it
Mon Apr 11 22:50:48 CEST 2011
Dear Paolo and Nichols,
as a follow up, I had a brief meeting with the sysadmin of our local BGP. It looks like the timings I was reporting actually correspond to the maximum I/O throughput of that specific rack, which depends on the number of I/O nodes present on the rack itself (in that case, 4 I/O nodes per midplane, each of them capable of 350 MB/s, corresponding to 1.7 GB/s for that midplane).
In the example I was reporting:
davcio : 1083.30s CPU 1083.30s WALL ( 38 calls)
I've been running on just 128 nodes (512 cores in VN mode), therefore I had only one I/O node (1 midplane = 4 x 128 nodes, for the non-BG/P-ers). Now, the total size of the .wfc files was around 9200 MB, which cannot be written in less than 9200/350 = 26.3 sec, according to the figures that the sysadmins gave me.
In my case the timings give: 1083.30s/38=28.5s, which is close to the theoretical maximum.
I will perform more testing and I will take into consideration the suggestion of Nichols about the number of files per node. In our machine we have one rack with 16 I/O nodes per midplane, I will try to see if the I/O performance scales accordingly.
As a side effect, I met a problem in the timing procedure. I found very different davcio timings (i.e. 3 orders of magnitude!) for two jobs where the size of the wavefunctions differed by a factor 2 only (the jobs have been executed on the same rack and with the same number of processors and same parallelization scheme).
The sysadmins replied that I/O bandwidth measured in the fastest case is not attainable on BG/P, and should be imputed to an inaccurate measurement of cputime/walltime.
I'm going to investigate this anyway.
I'm not aware of anyone working on MPI I/O porting.
Thanks so far for your suggestions,
Il giorno 11/apr/2011, alle ore 20.02, Nichols A. Romero ha scritto:
> Sorry for not replying earlier, but I missed this e-mail due to the
> APS March Meeting.
> The GPFS file system on BG/P does a poor job at handling writes to more than
> one file per node. My guess is that Gabriele was running QE in either dual
> or VN mode (2 and 4 MPI tasks per node, respectively). So on BG/P,
> you basically
> want to write one file per node (which GPFS is designed to handle) or
> one big file
> using MPI-I/O.
> At ANL, we are thinking about re-writing some of the I/O
> using parallel I/O (e.g. HDF5, Parallel NetCDF). The simplest
> approach, though highly
> unportable, is to use the MPI I/O directly.
> Has anyone on this list worked on parallel I/O with QE? Or have any
> strong opinions
> on this issue?
> On Wed, Mar 30, 2011 at 11:57 AM, Paolo Giannozzi
> <giannozz at democritos.it> wrote:
>> On Mar 30, 2011, at 11:20 , Gabriele Sclauzero wrote:
>>> Do you think that having an additional optional level of I/O
>>> (let's say that it might be called "medium")
>> I propose 'rare', 'medium', 'well done'
>>> would be too confusing for users?
>> some users get confused no matter what
>>> I could try to implement and test it.
>> ok: just follow the "io_level" variable. Try first to understand
>> what the actual behavior is (the documentation is not so
>> clear on this point) and then think what it should be, if you
>> have some clear ideas
>> Paolo Giannozzi, Dept of Chemistry&Physics&Environment,
>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
>> Phone +39-0432-558216, fax +39-0432-558222
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
> Nichols A. Romero, Ph.D.
> Argonne Leadership Computing Facility
> Argonne, IL 60490
> (630) 447-9793
> Pw_forum mailing list
> Pw_forum at pwscf.org
§ Gabriele Sclauzero, EPFL SB ITP CSEA
PH H2 462, Station 3, CH-1015 Lausanne
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users