[Pw_forum] wfc files: heavy I/O, handling for restarts

S. K. S. sks.jnc at gmail.com
Mon Sep 5 21:05:02 CEST 2011


>>>>>>>> # Scatter wfc restart files
       awk '{ files_for[$1] = files_for[$1] " '$basename'.wfc" NR }
           END { for (host in files_for) print host, files_for[host] }'
$PBS_NODEFILE \
           | while read host files
           do
               ssh -n $host "cd $PBS_O_WORKDIR; mv $files $ESPRESSO_TMPDIR/"
           done
       # on master host, copy .save directory as well
       rsync -a $basename.save $ESPRESSO_TMPDIR


       mpirun  -x ESPRESSO_TMPDIR \
               -np $(wc -l < $PBS_NODEFILE) \
               -machinefile  $PBS_NODEFILE  \
               pw.x -inp input.txt > output.txt

       # Gather remote files
       uniq $PBS_NODEFILE \
           | while read host
           do
               ssh -n $host "rsync -a $TMPDIR/ $PBS_O_WORKDIR/"
           done
       ------------------------------

E.g. for a job with nodes=3:ppn=4 the scatter part would distribute the
existing files pwscf.wfc{1..12} as follows: <<<<<<

Yes, this looks indeed cumbersome. It becomes more painful when one can not
know a priori in which nodes his/her job
will go, particularly  when it is totally decided by the automatic queue
decider,  depending on the free nodes available.
In such a situation,  one's restarted job may go to a totally new set of
nodes, and phonon calculation can not get necessary
files to restart. Then restarting phonon calculation becomes more difficult.


It seems, there is a more serious trouble in the recent version of QE. In
the version before QE4.2, the QE codes used to
replicate the same necessary files to the distributed local  disks of all
the nodes. In this case, at least phonon calculation
can run smoothly instead of  crashing. But in the recent version, phonon
calculations just stop by complaining that
the distributed .wfc files in one node are not visible by another node.

If a quick remedy of this problem is not easy, then at least, for the time
being, it is better to keep the earlier option of replicating
the same .wfc files in all the nodes still working in the version 4.3.1.
Other better option  can be to implement the "WF_COLLECT"
trick, also in phonon code, as it is already there for PW.x.

Thanks and regards,
Saha SK
R&D Assistant
JNCASR
Bangalore 560064
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20110905/ea2b77e7/attachment.html>


More information about the users mailing list