[Pw_forum] wfc files: heavy I/O, handling for restarts
S. K. S.
sks.jnc at gmail.com
Mon Sep 5 21:05:02 CEST 2011
>>>>>>>> # Scatter wfc restart files
awk '{ files_for[$1] = files_for[$1] " '$basename'.wfc" NR }
END { for (host in files_for) print host, files_for[host] }'
$PBS_NODEFILE \
| while read host files
do
ssh -n $host "cd $PBS_O_WORKDIR; mv $files $ESPRESSO_TMPDIR/"
done
# on master host, copy .save directory as well
rsync -a $basename.save $ESPRESSO_TMPDIR
mpirun -x ESPRESSO_TMPDIR \
-np $(wc -l < $PBS_NODEFILE) \
-machinefile $PBS_NODEFILE \
pw.x -inp input.txt > output.txt
# Gather remote files
uniq $PBS_NODEFILE \
| while read host
do
ssh -n $host "rsync -a $TMPDIR/ $PBS_O_WORKDIR/"
done
------------------------------
E.g. for a job with nodes=3:ppn=4 the scatter part would distribute the
existing files pwscf.wfc{1..12} as follows: <<<<<<
Yes, this looks indeed cumbersome. It becomes more painful when one can not
know a priori in which nodes his/her job
will go, particularly when it is totally decided by the automatic queue
decider, depending on the free nodes available.
In such a situation, one's restarted job may go to a totally new set of
nodes, and phonon calculation can not get necessary
files to restart. Then restarting phonon calculation becomes more difficult.
It seems, there is a more serious trouble in the recent version of QE. In
the version before QE4.2, the QE codes used to
replicate the same necessary files to the distributed local disks of all
the nodes. In this case, at least phonon calculation
can run smoothly instead of crashing. But in the recent version, phonon
calculations just stop by complaining that
the distributed .wfc files in one node are not visible by another node.
If a quick remedy of this problem is not easy, then at least, for the time
being, it is better to keep the earlier option of replicating
the same .wfc files in all the nodes still working in the version 4.3.1.
Other better option can be to implement the "WF_COLLECT"
trick, also in phonon code, as it is already there for PW.x.
Thanks and regards,
Saha SK
R&D Assistant
JNCASR
Bangalore 560064
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20110905/ea2b77e7/attachment.html>
More information about the users
mailing list