[Pw_forum] submits jobs to node already in use

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Wed May 6 15:44:15 CEST 2009


On Wed, 2009-05-06 at 06:01 -0700, Jonas Baltrusaitis wrote:
> Axel, I mentioned PBS so qsub was implied. 

jonas,

please always be specific and never imply when posting messages
to a mailing list. i have seen people doing extremely non-sensical
things (in good faith), so it is impossible to tell up front, if
something is implied or not. more importantly, in these issues
details matter... a _lot_. if you leave them out, you restrict
the accuracy of the answer you get. you _do_ want to get an accurate
answer, do you?

> Machinefile is where it's supposed to be mpiexec -n 4 -machinefile $PBS_NODEFILE
> 

> So I am not sure what's happening, if I submit say cpmd job it just
>  goes to a different node, whereas pwscf to the same which already has
>  a job running

i'd have to see the qsub script and know the qsub command line. but as 
a matter of principle. the node assignment is done by the resource
manager and/or the MPI frontend command. _never_ by the application
itself. i very much dislike the use of machine files to begin with.
you should check whether the MPI library that you are using actually
supports the TM job launch mechanism of torque/pbs. in that case you
don't have to specify any machine file at all, _and_ have the added
benefit that the resource manager will watch and kill all MPI tasks.

two possible options are: you have a typo in your script. or
you have multiple MPI implementation installed and you are launching
an application compiled against one of them with a launch script
of the other.

> If I submit another pwscf with a simple script, not the one I adopted
>  from Examples 2 and 9, it submits it to an independent node. It must
>  be within my script, I guess Rocks forum was right after all. I'll
>  have to look through it carefully

i disagree. the shell scripts that you write have nothing to do with
the application itself. the job scripts shipped with Q-E are meant for
interactive use. if you adapt them for batch use, you are at your own.
particularly the -machinefile or equivalent flag is effectively
something that works on good faith, but not on control. ask any sysadmin
at a national center through which pains they go to confine parallel
jobs to the nodes that have allocated.

cheers,
    axel.

> Jonas
> 
> 
> --- On Tue, 5/5/09, Axel Kohlmeyer <akohlmey at cmm.chem.upenn.edu> wrote:
> 
> > From: Axel Kohlmeyer <akohlmey at cmm.chem.upenn.edu>
> > Subject: Re: [Pw_forum] submits jobs to node already in use
> > To: jasius_1 at yahoo.com, "PWSCF Forum" <pw_forum at pwscf.org>
> > Date: Tuesday, May 5, 2009, 8:29 PM
> > On Tue, 2009-05-05 at 17:30 -0700, Jonas Baltrusaitis wrote:
> > > Funny story, if I submit any job with mpiexec -n 4
> > -machinefile
> > 
> > where is the machinefile??
> > 
> > >  /../pw.x -rmpool 0 -nodes 1 -procs 4 via PBS to a 4
> > core processor I
> > 
> > this must be a specialty of how your machine is set up. if
> > you submit
> > to a PBS queue, you don't use mpiexec, but rather use
> > qsub.
> > mpiexec (or mpirun) is being called from within the job
> > script
> > that you submit to qsub.
> > 
> > >  can't see the exact number of cores running, e.g.
> > with qstat -n I see
> > >  only    compute-0-1/0, instead of usual
> > compute-0-1/0, compute-0-1/1,
> > >  compute-0-1/2, compute-0-1/4.
> > 
> > again. this has nothing at all to do with Q-E. there is no
> > code in
> > Q-E that knows anything about any batch system (and it
> > should not).
> > 
> > > It gets even worse if I want to submit any other pwscf
> > job to my 16
> > >  node cluster. It get's submitted to exactly the
> > same node, e.g. I run
> > >  8 processes on a 4 core node and obviously no
> > performance
> > > 
> > > I inquired at Rocks forum, they claim it must be
> > related with how Q-E
> > >  submits jobs. Has anybody seen that before?
> > 
> > that is wrong advice. 
> > 
> > cheers,
> >    axel.
> > 
> > 
> > > 
> > > Jonas Baltrusaitis
> > > University of Iowa
> > > 
> > > 
> > > 
> > >       
> > > _______________________________________________
> > > Pw_forum mailing list
> > > Pw_forum at pwscf.org
> > > http://www.democritos.it/mailman/listinfo/pw_forum
> > 
> > -- 
> > =======================================================================
> > Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu  
> > http://www.cmm.upenn.edu
> >    Center for Molecular Modeling   --   University of
> > Pennsylvania
> > Department of Chemistry, 231 S.34th Street, Philadelphia,
> > PA 19104-6323
> > tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel:
> > 1-215-898-5425
> > =======================================================================
> > If you make something idiot-proof, the universe creates a
> > better idiot.
> 
> 
>       

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.




More information about the users mailing list