[QE-users] PBS script and tmp stogage
Paolo Giannozzi
p.giannozzi at gmail.com
Fri Feb 21 08:18:01 CET 2020
What you describe doesn't seem to happen any longer in the development
version. There have been a few changes since and now all operations on a
file are done only by the processor that reads or writes it. Note however
that there might still be problems with k-point parallelization. Basically:
I/O for non-parallel file systems is not guaranteed.
Paolo
On Thu, Feb 20, 2020 at 7:48 PM janardhan H.L. <janardhanhl at yahoo.com>
wrote:
> Dear prof. Giannozzi
>
> I am writing to the same thread as it may be relevant here.
>
> I am using qe 6.5 on 3 node linux cluster.
> When the calculation is performed everything runs normally. When saving wf
> something unusual happens.
> 1) calculation exits without time stamps and job done stamp
> 2) this happened due to mpi exit from one of the node which cannot write
> to out dir.
> 3) scf run only starts after copying the files to slave nodes without
> which it will terminate saying files cannot be read.
> 4) after pointing outdir to common paths (via NFS).
> These errors have disappeared.
>
> 1) My question is if recent versions of QE is collecting all the wf to
> head node why slave nodes eco mpi abort while they have no access to head
> node.
> 2) is there any way that we can restart calculations without copying to
> slave nodes.
>
> Thanks and regards
> Janardhan
>
>
>
>
> On Thursday, 20 February, 2020, 11:15:53 pm IST, Paolo Giannozzi <
> p.giannozzi at gmail.com> wrote:
>
>
> It's a long story. By default, recent versions of QE collect both the
> wavefunctions and the charge density into a single array on a single
> processor that writes them to file. Even if you do not have a parallel file
> system, your data is no longer spread on scratch directories that are not
> visible to the other processors. This means that in principle it is
> possible to restart, witj several potential caveats:
> - there is no guarantee that a batch queuing system will distribute
> processes across processors in the same way as in the previous run;
> - pseudopotential files are in principle read from data file so they may
> still be a source of problems;
> - if you parallelize on k-points, with Nk pools, one process per pool will
> write wavefunctions, that will thus end up on Nk different processors.
>
> Paolo
>
> On Thu, Feb 20, 2020 at 4:54 PM alberto <voodoo.bender at gmail.com> wrote:
>
> Hi,
> I'm using QE in some single point simulations.
> In particular I'm running scf/nscf calculations
>
> In my block input
>
> calculation = 'nscf' ,
> restart_mode = 'from_scratch' ,
> outdir = './tmp_qe' ,
> pseudo_dir =
> '/home/alberto/QUANTUM_ESPRESSO/BASIS/upf_files/' ,
> prefix = 'BIS-IMID-PbI4_SR' ,
> verbosity = 'high' ,
> etot_conv_thr = 1.0D-8 ,
> forc_conv_thr = 1.0D-7 ,
> wf_collect = .true.
>
> the out dir is located in /home/alberto/ and I notice that the
> writing/reading time is very long
>
> I would use /tmp dir of one node where the jobs is running.
> (my cluster has got some nodes xeon to 20 CPU every nodes)
>
> This is my PBS script
>
> ## Script for parallel Quantum Espresso job by Alberto
> ## Run script with 3 arguments:
> ## $1 = Name of input-file, without extension
> ## $2 = Numbers of nodes to use (ncpus=nodes*20)
> ## $3 = Module to run
>
> if [ -z "$1" -o -z "$2" -o -z "$3" ]; then
> echo "Usage: $0 <input_file> <np> <module> "
> fi
>
> if [ $2 -ge 8 ]; then
> NODES=$(($2/20))
> CPUS=20
> else
> NODES=1
> CPUS=$2
> fi
>
> cat<<EOF>$1.job
> #!/bin/bash
> #PBS -l
> nodes=xeon1:ppn=$CPUS:xeon20+xeon2:ppn=$CPUS:xeon20+xeon3:ppn=$CPUS:xeon20+xeon4:ppn=$CPUS:xeon20+xeon5:ppn=$CPUS:xeon20+xeon6:ppn=$CPUS:xeon20
> #PBS -l walltime=9999:00:00
> #PBS -N $1
> #PBS -e $1.err
> #PBS -o $1.sum
> #PBS -j oe
> job=$1 # Name of input file, no extension
> project=\$PBS_O_WORKDIR
> cd \$project
> cat \$PBS_NODEFILE > \$PBS_O_WORKDIR/nodes.txt
>
> export OMP_NUM_THREADS=$(($2/40))
> time /opt/openmpi-1.4.5/bin/mpirun -machinefile \$PBS_NODEFILE -np $2
> /opt/qe-6.4.1/bin/$3 -ntg $(($2/60)) -npool $(($2/60)) < $1.inp > $1.out
> EOF
>
> qsub $1.job
>
> how could I use the directory /tmp and avoid that the nscf calculation
> don't stop it because no files are found! really the files are present,
> but they are divided on different nodes
>
> regards
>
> Alberto
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
>
>
>
> --
> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
> Phone +39-0432-558216, fax +39-0432-558222
>
> _______________________________________________
> Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
> users mailing list users at lists.quantum-espresso.org
> https://lists.quantum-espresso.org/mailman/listinfo/users
>
--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 208, 33100 Udine, Italy
Phone +39-0432-558216, fax +39-0432-558222
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20200221/d05d8278/attachment.html>
More information about the users
mailing list