[Pw_forum] Question of a system administrator about pw.x usage

Filippo Spiga spiga.filippo at gmail.com
Fri Sep 21 22:38:04 CEST 2012


Dear Silas,

Unfortunately there are few things you can do to improve the performance of a NFS-based distributed file system. Try to keep as less I/O as possible (flag disk_io = 'none' in the input)

I remember you can tell to PW to write each wfc.XX files in the local scratch (am I wrong? something changed?) by simply point the output directory to the local scratch (but it might be necessary to specify something else…). Then at the end by using a epilogue script (this is for PBS/TORQUE, other resource managers have the same capability but the file might have a different name) you can collect in a central point all the files for post-processing. You do not reduce the amount of data that you move but by using rcp or other tricks you can definitely reduce the overall loss in performance due to the data movement on the network. The other fact is … you have to write the epilogue script by yourself. It can be bash or csh or even python. A scripting exercise  :-)

Let's wait for other suggestion of true experts…

HTH

F.

On Sep 21, 2012, at 7:10 PM, Silas Silva <silas.silva at ufabc.edu.br> wrote:
> Hello all,
> 
> I'm a system administrator of the High Performance Computing Center at
> Universidade Federal do ABC - Brazil (http://hpc.ufabc.edu.br/).  I'm
> not used about the internals of scientific research and the tools you
> are used, but we have ran into problems regarding pw.x usage of one of
> the users we support.
> 
> She has a simple job submission file that run pw.x, with an input and an
> output file.  After hours of execution, the system load increases
> absurdly.  We can see it is I/O stuff, but we could not discover why it
> happens later in the execution of the program nor how to fix it.
> 
> In this cluster, we use NFS for both "distributed scratch" and home
> folders (we know we should use a modern parallel file system, but it is
> not possible for the moment), but each node has a big local scratch
> partition.
> 
> Some questions:
> 
> 1. Why I/O happens later in the execution of pw.x?
> 
> 2. Documentation (here: http://www.quantum-espresso.org/wp-content/uploads/Doc/user_guide/node18.html#SECTION00043100000000000000)
>   is not clear about the "distributed" or "collected" work.  Although
>   it has some tips, I still wonder about suggesting to our user about
>   the best configuration.  What can be "parallelized"?  What may remain
>   in one place?
> 
> 3. Is there any flag or configuration we can pass to pw.x to see what it
>   is doing?  Any debug flag?
> 
> 4. Variables you place in input file are documented anywhere?
> 
> Thank you very much.
> 
> -- 
> Silas Silva
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum

--
Mr. Filippo SPIGA, M.Sc., Ph.D. Candidate 
CADMOS - Chair of Numerical Algorithms and HPC (ANCHP)
École Polytechnique Fédérale de Lausanne (EPFL)
http://anchp.epfl.ch ~ http://filippospiga.me ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20120921/78fd5cae/attachment.html>


More information about the users mailing list