[Pw_forum] wfc files: heavy I/O, handling for restarts

Mon Sep 5 05:22:31 CEST 2011

Dear fellow users and developers,

What's the current wisdom regarding wfc updates hammering a networked file system?

Details:

I have trouble with what the author of the user guide <http://www.quantum-espresso.org/user_guide/node18.html> knowingly calls "excessive I/O", I see users running some 20 .. 40 pw.x processes which concurrently write large wfc files. Those writes choke my Lustre file system. So, count me in as an "angry system manager". I will be throwing more hardware at this problem shortly, but I feel there is room for improvement in other ways.

The problem arises because pw.x is being run with the somewhat lazy setting of ESPRESSO_TMPDIR=".", which means all scratch files are being dumped into $PBS_O_WORKDIR, typically somewhere in $HOME or a parallel scratch file system. I wonder to what extent "a modern Parallel File System" as prescribed by the documentation is actually needed, other than the requirement that it provide lots of R/W bandwidth. If one MPI rank writes a file, must this file indeed be seen or even readable by another MPI rank? It appears not - one can run pw.x just fine with $ESPRESSO_TMPDIR pointing to local scratch directories on the nodes.

The tricky bits are

	- stageout, i.e., gathering wfc (and bfgs) files from the nodes upon job termination, regular or otherwise, so as to preserve intermediate results, and

	- stagein for restarts, i.e., provide to each MPI rank exactly the required .wfc<nnn> in its local $ESPRESSO_TMPDIR.

I did a proof-of-concept using the following code in a PBS/Torque job file:

	---------------------------------------------------
	# $TMPDIR is provided as pointing to a job-specific node-local scratch
	# directory that is created on all nodes under the same name.
	export ESPRESSO_TMPDIR=$TMPDIR

	basename="pwscf"

	# Scatter wfc restart files
        awk '{ files_for[$1] = files_for[$1] " '$basename'.wfc" NR }
            END { for (host in files_for) print host, files_for[host] }' $PBS_NODEFILE \
            | while read host files
            do
                ssh -n $host "cd $PBS_O_WORKDIR; mv $files $ESPRESSO_TMPDIR/"
            done
        # on master host, copy .save directory as well
        rsync -a $basename.save $ESPRESSO_TMPDIR

	mpirun  -x ESPRESSO_TMPDIR \
        	-np $(wc -l < $PBS_NODEFILE) \
	        -machinefile  $PBS_NODEFILE  \
        	pw.x -inp input.txt > output.txt

	# Gather remote files
        uniq $PBS_NODEFILE \
            | while read host
            do
                ssh -n $host "rsync -a $TMPDIR/ $PBS_O_WORKDIR/"
            done
	---------------------------------------------------

E.g. for a job with nodes=3:ppn=4 the scatter part would distribute the existing files pwscf.wfc{1..12} as follows:

	ssh -n n340 'cd /home/stern/test/quantum-espresso/restart_test/run8; mv pwscf.wfc5 pwscf.wfc6 pwscf.wfc7 pwscf.wfc8 /tmp/191405.mds01.carboncluster/'
	ssh -n n342 'cd /home/stern/test/quantum-espresso/restart_test/run8; mv pwscf.wfc9 pwscf.wfc10 pwscf.wfc11 pwscf.wfc12 /tmp/191405.mds01.carboncluster/'
	ssh -n n339 'cd /home/stern/test/quantum-espresso/restart_test/run8; mv pwscf.wfc1 pwscf.wfc2 pwscf.wfc3 pwscf.wfc4 /tmp/191405.mds01.carboncluster/'
	rsync -a pwscf.save /tmp/191405.mds01.carboncluster

(I chose "mv" rather than "cp" for the proof of concept to be sure there's only one instance per wfc file available.)

Now, this is of course cumbersome code to repeat in production job scripts, but the scatter and gather bits could be isolated into utility scripts callable by a single line. Torque provides for Prologue & Epilogue Scripts <http://www.adaptivecomputing.com/resources/docs/torque/a.gprologueepilogue.php>, but those have rather restrictive runtime environments.

Is this something to pursue further?

To avoid the stagein file name+number juggling, could the fopen() functions for wfc files (and others) perhaps be wrapped such that if a file is not found in $ESPRESSO_TMPDIR it is read instead from in "." but written to $ESPRESSO_TMPDIR.  The stageout is somewhat simpler and in fact not specific to pw.x at all.

With best regards,
Michael