[Pw_forum] why my pw.x run with low efficiency?

Mon Sep 22 05:31:52 CEST 2008

Dear sir,

Thank you so much for your responding. I do appreciate your help.

> than the wall time. here it looks as if the jobs is either
> swapping like crazy or the communication is stalling. on that

I believe it. So do you think 10G infiniband is good enough for my job?
By the way, there is also another two parallel job on the line. one is
lammps, a classical MD code. the other is VASP. the lammps and vasp are
using 32 cpus in total. my job is using 39 nodes and two cpu for each node.

>vega mentioned that openmpi didn't work because
> of "lack of memory". i suspect that this is due to incorrect
> setup of the infiniband fabric and user limits. ulimit -a
> should produce on the compute nodes something like this.
> particularly the "max locked memory" entry is very important,
> and not setting it high enough will result in severely
> degraded performance (cf. ofed documentation).

I'm really admiring your experiences and sense about the parallel job.
the max lock memory option on my machines is 4. Now I asked my
system administrator to reset it to 1024000. My physical memory
is 2G. Do you think set it to 2048000 will be better?
I'll try openmpi again, later, using flags --mca btl_openib_use_srq 1
with mpirun.

By the way I'm using MPICH2 now, not MPICH.

> to see it). also one can see how much swapping is going on by
> simply logging into the node and monitoring whether kswapd
> processes actually consume a lot of time.

I checked it as your suggested. kswapd did not consume a lot of time.
I think the problem was probably caused by the network.

> in the (somewhat unlikely) case that one is running out of
> memory, it may help to not use all processors per node. i don't

your words remind me of a previous job. I run the job for one thread per 
core,
the job failed. When I reduce the threads, the job can be done smoothly.

> finally, it may just be that the input at hand doesn't scale
> to that many cpus, so in case of seeing performance problems
> it always pays off to check whether running with less processors
> or nodes is actually faster. bigger is not always better!!

> limits of QE in both cases. i suggest to look up that e-mail
> from the mailing list archives and do a similar assessment
> with special consideration of the additional concerns listed
> in this e-mail.

Thank you so much for sharing your knowledge and experiences to me.

best wishes,

vega

=================================================================================
Vega Lew (weijia liu)
PH.D Candidate in Chemical Engineering
State Key Laboratory of Materials-oriented Chemical Engineering
College of Chemistry and Chemical Engineering
Nanjing University of Technology, 210009, Nanjing, Jiangsu, China

--------------------------------------------------
From: "Axel Kohlmeyer" <akohlmey at cmm.chem.upenn.edu>
Sent: Monday, September 22, 2008 6:08 AM
To: "Paolo Giannozzi" <giannozz at democritos.it>
Cc: "PWSCF Forum" <pw_forum at pwscf.org>
Subject: Re: [Pw_forum] why my pw.x run with low efficiency?

> On Sun, 21 Sep 2008, Paolo Giannozzi wrote:
>
> PG>
> PG> On Sep 19, 2008, at 18:54 , vega wrote:
> PG>
> PG> >      PWSCF        :     0d   14h46m CPU time,        2d   18h 4m
> PG> > wall time
> PG>
> PG> I am quite sure Axel Kohlmeeyer has already answered 2n+1 times
> PG> to the sameor similar question. Please look in particular for
> PG> OMP_NUM_THREADS in the archives of the mailing list
>
> sorry paolo,
> but that is only the case when the CPU time is much _higher_
> than the wall time. here it looks as if the jobs is either
> swapping like crazy or the communication is stalling. on that
> note, it would be nice to also see the wall time spent in the
> individual routines, as the cpu time is usually a somewhat
> inadequate descriptor, except for serial calculations.
>
> i'd have commented on the discussion earlier on, as several
> assessments are unsubstantiated or don't make much sense at all.
>
> but the only way to find out for sure what is happening,
> would be to run that very same job on a machine where i
> know for certain that the hardware and software is set up
> correctly and QE is compiled in the best possible way.
> this however is very unpractical from addis ababa where
> reasonable internet access is only available intermittently
> and at very high costs.
>
> thus only a bunch of questions regarding observations and
> explanations. vega mentioned that openmpi didn't work because
> of "lack of memory". i suspect that this is due to incorrect
> setup of the infiniband fabric and user limits. ulimit -a
> should produce on the compute nodes something like this.
> particularly the "max locked memory" entry is very important,
> and not setting it high enough will result in severely
> degraded performance (cf. ofed documentation).
>
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> max nice                        (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 71680
> max locked memory       (kbytes, -l) 4096000
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> max rt priority                 (-r) 0
> stack size              (kbytes, -s) 1024000
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 71680
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> another option to enhance OpenMPI performance for QE over
> infiniband, while reducing its memory requirements is to
> use the flag: --mca btl_openib_use_srq 1 with mpirun or
> set it for all jobs in the etc/openmpi-mca-params.conf
> within the openmpi installation directory.
>
> it was then mentioned, that the job was calculated using MPICH
> instead, which brings up the question, does this MPICH compile
> actually support infiniband. as far as i know, not. the usual
> OFED distributions contain either OpenMPI or MVAPICH.
>
> the next step to evaluate is whether the job actually does
> scale that far, and if memory requirements are really killing
> the performance. the reported virtual memory size is a bad
> indicator, as it reports the reserved address space, but does
> not say anything about how much of it is actually used. with
> openmpi over infiniband, this tends to always be ridiculously
> high, but that has no impact on the performance (symptom: large
> usage of SWAP for the process without swap actually being used).
> the real information about lack of available physical memory
> would come from comparing the data set size with the resident
> set size (you have to change the default configuration of top
> to see it). also one can see how much swapping is going on by
> simply logging into the node and monitoring whether kswapd
> processes actually consume a lot of time.
>
> in the (somewhat unlikely) case that one is running out of
> memory, it may help to not use all processors per node. i don't
> remember what the hardware exactly was, but, e.g., on nodes with
> 2x intel quad core processors, the (absolute!) performance and
> scaling is much better when running with half the available cpu
> cores. i have posted some benchmark graphs on cp2k here:
>
> http://cp2k.googlegroups.com/web/cp2k-water-bench-cmm.png
>
> but the outcome is similar for other DFT codes (CPMD, Q-E).
> using less MPI tasks should reduce the total memory requirements
> a bit, since not all data in QE can be or is distributed amongst
> the nodes.
>
> finally, it may just be that the input at hand doesn't scale
> to that many cpus, so in case of seeing performance problems
> it always pays off to check whether running with less processors
> or nodes is actually faster. bigger is not always better!!
>
> in summary, giving a recommendation on what is causing the
> degraded performance, is impossible without knowing more details
> and particularly a more systematic performance analysis.
> i've given already an example of how to do this a while ago,
> comparing scaling behavior with using -npools and without and
> comparing gigabit to myrinet and demonstrating the scaling
> limits of QE in both cases. i suggest to look up that e-mail
> from the mailing list archives and do a similar assessment
> with special consideration of the additional concerns listed
> in this e-mail.
>
> cheers,
>    axel.
>
> PG>
> PG> Paolo
> PG> ---
> PG> Paolo Giannozzi, Dept of Physics, University of Udine
> PG> via delle Scienze 208, 33100 Udine, Italy
> PG> Phone +39-0432-558216, fax +39-0432-558222
> PG>
> PG>
> PG>
> PG> _______________________________________________
> PG> Pw_forum mailing list
> PG> Pw_forum at pwscf.org
> PG> http://www.democritos.it/mailman/listinfo/pw_forum
> PG>
>
> -- 
> =======================================================================
> Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
>   Center for Molecular Modeling   --   University of Pennsylvania
> Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
> tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
> =======================================================================
> If you make something idiot-proof, the universe creates a better idiot.
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum
>