[Pw_forum] why my pw.x run with low efficiency?
Axel Kohlmeyer
akohlmey at cmm.chem.upenn.edu
Mon Sep 22 00:08:30 CEST 2008
On Sun, 21 Sep 2008, Paolo Giannozzi wrote:
PG>
PG> On Sep 19, 2008, at 18:54 , vega wrote:
PG>
PG> > PWSCF : 0d 14h46m CPU time, 2d 18h 4m
PG> > wall time
PG>
PG> I am quite sure Axel Kohlmeeyer has already answered 2n+1 times
PG> to the sameor similar question. Please look in particular for
PG> OMP_NUM_THREADS in the archives of the mailing list
sorry paolo,
but that is only the case when the CPU time is much _higher_
than the wall time. here it looks as if the jobs is either
swapping like crazy or the communication is stalling. on that
note, it would be nice to also see the wall time spent in the
individual routines, as the cpu time is usually a somewhat
inadequate descriptor, except for serial calculations.
i'd have commented on the discussion earlier on, as several
assessments are unsubstantiated or don't make much sense at all.
but the only way to find out for sure what is happening,
would be to run that very same job on a machine where i
know for certain that the hardware and software is set up
correctly and QE is compiled in the best possible way.
this however is very unpractical from addis ababa where
reasonable internet access is only available intermittently
and at very high costs.
thus only a bunch of questions regarding observations and
explanations. vega mentioned that openmpi didn't work because
of "lack of memory". i suspect that this is due to incorrect
setup of the infiniband fabric and user limits. ulimit -a
should produce on the compute nodes something like this.
particularly the "max locked memory" entry is very important,
and not setting it high enough will result in severely
degraded performance (cf. ofed documentation).
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
max nice (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 71680
max locked memory (kbytes, -l) 4096000
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
max rt priority (-r) 0
stack size (kbytes, -s) 1024000
cpu time (seconds, -t) unlimited
max user processes (-u) 71680
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
another option to enhance OpenMPI performance for QE over
infiniband, while reducing its memory requirements is to
use the flag: --mca btl_openib_use_srq 1 with mpirun or
set it for all jobs in the etc/openmpi-mca-params.conf
within the openmpi installation directory.
it was then mentioned, that the job was calculated using MPICH
instead, which brings up the question, does this MPICH compile
actually support infiniband. as far as i know, not. the usual
OFED distributions contain either OpenMPI or MVAPICH.
the next step to evaluate is whether the job actually does
scale that far, and if memory requirements are really killing
the performance. the reported virtual memory size is a bad
indicator, as it reports the reserved address space, but does
not say anything about how much of it is actually used. with
openmpi over infiniband, this tends to always be ridiculously
high, but that has no impact on the performance (symptom: large
usage of SWAP for the process without swap actually being used).
the real information about lack of available physical memory
would come from comparing the data set size with the resident
set size (you have to change the default configuration of top
to see it). also one can see how much swapping is going on by
simply logging into the node and monitoring whether kswapd
processes actually consume a lot of time.
in the (somewhat unlikely) case that one is running out of
memory, it may help to not use all processors per node. i don't
remember what the hardware exactly was, but, e.g., on nodes with
2x intel quad core processors, the (absolute!) performance and
scaling is much better when running with half the available cpu
cores. i have posted some benchmark graphs on cp2k here:
http://cp2k.googlegroups.com/web/cp2k-water-bench-cmm.png
but the outcome is similar for other DFT codes (CPMD, Q-E).
using less MPI tasks should reduce the total memory requirements
a bit, since not all data in QE can be or is distributed amongst
the nodes.
finally, it may just be that the input at hand doesn't scale
to that many cpus, so in case of seeing performance problems
it always pays off to check whether running with less processors
or nodes is actually faster. bigger is not always better!!
in summary, giving a recommendation on what is causing the
degraded performance, is impossible without knowing more details
and particularly a more systematic performance analysis.
i've given already an example of how to do this a while ago,
comparing scaling behavior with using -npools and without and
comparing gigabit to myrinet and demonstrating the scaling
limits of QE in both cases. i suggest to look up that e-mail
from the mailing list archives and do a similar assessment
with special consideration of the additional concerns listed
in this e-mail.
cheers,
axel.
PG>
PG> Paolo
PG> ---
PG> Paolo Giannozzi, Dept of Physics, University of Udine
PG> via delle Scienze 208, 33100 Udine, Italy
PG> Phone +39-0432-558216, fax +39-0432-558222
PG>
PG>
PG>
PG> _______________________________________________
PG> Pw_forum mailing list
PG> Pw_forum at pwscf.org
PG> http://www.democritos.it/mailman/listinfo/pw_forum
PG>
--
=======================================================================
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.
More information about the users
mailing list