[Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x)
Reza Behjatmanesh-Ardakani
reza_b_m_a at yahoo.com
Sun Jun 22 09:12:45 CEST 2014
Dear Axel
Thank you. It was very helpful for me.
As you said some new GTX cards have good DP floating point performance such as GTX Ti Black or GTX Ti Z that for both DP is 1/3 of SP.
They are much cheaper than Tesla cards.
I am not sure that Ti Black or Ti Z has ECC. Quadro K6000 has it.
Thanks again.
With the Best Regards
Reza Behjatmanesh-Ardakani
Associate Professor of Physical Chemistry
Address:
Department of Chemistry,
School of Science,
Payame Noor University (PNU),
Ardakan,
Yazd,
Iran.
E-mails:
1- reza_b_m_a at yahoo.com (preferred),
2- behjatmanesh at pnu.ac.ir,
3- reza.b.m.a at gmail.com.
--------------------------------------------
On Sat, 6/21/14, Axel Kohlmeyer <akohlmey at gmail.com> wrote:
Subject: Re: [Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x)
To: "PWSCF Forum" <pw_forum at pwscf.org>
Date: Saturday, June 21, 2014, 1:50 PM
On Sat, Jun 21, 2014 at 4:20 AM, Reza
Behjatmanesh-Ardakani
<reza_b_m_a at yahoo.com>
wrote:
> Dear Axel
> This was just a proposal. If I am right, Terachem code
can use gaming cards for GPU calculations (I saw some of its
authors' papers).
yes, but terachem was written from ground up with new
algorithms to
avoid loss of precision. in quantum mechanics this is
important, since
a lot of calculations depend on comparing large numbers of
equal sign
and magnitude and looking at the difference. about the only
part of a
plane wave DFT calculation that is "conservative" in terms
of
precision without a massive redesign are the FFTs. the loss
of
precision is fairly small when replacing double precision
FFTs with
single precision ones. for the many 3d-FFTs required, this
is
particularly beneficial when trying to scale out via MPI, as
this
reduces the amount of bytes that need to be sent and copied
around in
half and also reduces the strain on memory bandwidth.
> As you know, the main problem of GTX cards comes back
to two important things. One, single precision, and the
other lack of ECC.
ECC is a lesser issue. and it is not a problem of single
precision,
but lack of performance with double precision due to having
only a
fraction of double precision units. another issue is the
lack of RAM.
also you have to distinguish between different GTX cards. a
few of the
most high-end consumer cards *do* have the full set of
double
precision units and a large amount of RAM.
ECC is mostly relevant for people running a large number of
GPUs in a
supercomputer environment.
>
> It is not necessary to write a stand alone code. We can
test the QE-GPU with both TESLA and/or GTX and QE (cpu
only), and compare the outputs.
but it is pointless to run on a hardware that is not
competitive.
you'll have a hard time already to get a 2x speedup from
using a top
level tesla card vs. an all CPU run on a decent machine.
what would be
the point of having the GPU _decelerate_ your calculation?
in general, a lot of the GPU stuff is hype and
misinformation. the
following is a bit old, but still worth a read:
http://www.hpcwire.com/2011/12/13/ten_ways_to_fool_the_masses_when_giving_performance_results_on_gpus/
as a consequence of a very smart and successful PR strategy,
there is
now the impression that *any* kind of GPU will result in a
*massive*
speedup. even people with a laptop GPU with 2 SMs, no memory
bandwidth
are now expecting 100x speedups and more. however, except
for a few
corner cases and applications that are very well represented
on GPUs
(not very complex) and badly on a CPU, you will often get
more like a
2x-5x speedup in a "best effort" comparison of a well
equipped host
with a high-end GPU. in part, this situation has become
worse with
some choices made by nvidia hardware and software engineers.
while 5
years back, the difference between a consumer and a
computing GPU was
small, the consumer models have been systematically
"downgraded" (via
removing previously supported management features in the
driver and
having consumer cards be based on a simplified design that
mostly
makes them mid-level GPUs).
> I tested it for only one case (rutile 3*3*2 supercell),
and saw that the GTX output is similar to the CPU one.
>
> However, It is needed to test for different cases and
different clusters to be sure that the lack of ECC and
double precision has no effect on results.
sorry, this statement doesn't make any sense. it looks to
me, like you
need to spend some time learning what the technical
implications of
ECC and single-vs-double precision are (and the fact that it
is the
software that chooses which precision is used, not the
hardware)..
whether a card has ECC or not. broken memory is broken
memory. and if
it works, it works. so there is not much to test. if you
want to find
out, whether your GPU has broken or borderline memory, run
the GPU
memtest. it is much more effective at finding issues than
any other
application.
where ECC helps is for very long running calculations or
calculations
across a very large number of GPUs when a single bitflip can
render
the entire effort useless and result in a crash. in a dense
cluster
environment or badly cooled desktops, this is a high risk.
in a well
setup machine, it is less of a risk, but you have to keep in
mind that
running without ECC makes you "blind" for those errors. i
run a
cluster with a pile of Tesla GPUs and we have disabled ECC
since the
machines run very reliably due to some hacking around
restrictions
that nvidia engineers placed in their drivers.
https://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness
we also run consumer level GPUs, particularly in the login
nodes,
since they work fine for development and don't cost as
outrageously
much as the tesla models. however, for development,
absolute
performance is a lesser concern.
> As Filippo said formerly for GTX cards, the output may
be not reproducible. However, I think due to the nature of
SCF algorithm, the code can be used at least
when you have memory corruption due to bad/overheated
memory, no SCF
algorithm will save you. if you go back 10 years when CPUs
didn't have
all those power management and automatic self-protection and
also
memory modules in desktop were often of very low quality,
people
experienced a lot of problems. "signal 11" and "segmentation
fault"
were a common topic in many mailing lists on scientific (or
other)
software that caused a high CPU load.
but the indication of broken memory was usually a crash due
to a
segfault or bad data corruption leading to a massive change
in numbers
and often NaNs. once you have a single NaN in your data, it
will
spread like a highly infective virus and render the
calculation
invalid.
a well set up consumer level GPU will run as reliable as a
tesla or
better, only you cannot tell since the nvidia tools will not
show you.
the main issue is performance and available memory.
> for VC-RELAX, RELAX, and SCF types of calculations with
GTX cards. Of course, it should be tested. Thank you for
your interest.
you are not making much sense here either. but if it makes
you feel
better to do those tests, don't let me discourage you.
sometimes
people learn the best this way.
axel.
> With the Best Regards
>
> Reza Behjatmanesh-Ardakani
> Associate Professor of Physical Chemistry
> Address:
> Department of Chemistry,
> School of Science,
> Payame Noor University (PNU),
> Ardakan,
> Yazd,
> Iran.
> E-mails:
> 1- reza_b_m_a at yahoo.com
(preferred),
> 2- behjatmanesh at pnu.ac.ir,
> 3- reza.b.m.a at gmail.com.
>
> --------------------------------------------
> On Fri, 6/20/14, Axel Kohlmeyer <akohlmey at gmail.com>
wrote:
>
> Subject: Re: [Pw_forum] A "relax" input runs on
CPU (pw.x) but not on CPU-GPU (pw-gpu.x)
> To: "PWSCF Forum" <pw_forum at pwscf.org>
> Date: Friday, June 20, 2014, 2:19 PM
>
> On Fri, Jun 20, 2014 at 4:22 AM, Reza
> Behjatmanesh-Ardakani
> <reza_b_m_a at yahoo.com>
> wrote:
> > Dear Filippo
> >
> > Due to the nature of QE which is iterative,
I think
> lack of ECC and even double precision floating
point in
> gaming cards (GTX) comparing to tesla cards
> >
> > is not serious problem for QE-GPU. Some
authors have
> checked this for AMBER molecular dynamics
simulation code.
> See following site:
>
> classical MD is a very different animal than what
you do
> with QE.
> errors due to single precision to some properties
in
> classical MD are
> huge with all single precision calculations. to
compute a
> force from a
> distance will not be much affected, but summing
up the force
> can
> already be a problem. "good" classical MD codes
usually
> employ a mixed
> precision approach, where only the accuracy
insensitive
> parts are done
> in single precision. for very large system, even
double
> precision can
> show significant floating point truncation
errors. usually
> you are
> dependent on error cancellation, too, i.e. when
you study a
> simple
> homogenous system (as it is quite common in those
tests).
>
>
> >
> > http://www.hpcwire.com/2014/03/13/ecc-performance-price-worth-gpus
> >
> >
> > and see the following paper:
> >
> >
> >
>
www.rosswalker.co.uk/papers/2014_03_ECC_AMBER_Paper_10.1002_cpe.3232.pdf
> >
> >
> >
> > I encourage the users of QE-GPU to test it
for QE, and
> report the difference on the site.
>
> it is a waste of time and effort. people have
done DFT and
> HF in
> (partial) single precision before and only if you
write a
> new code
> from scratch and have an extremely skilled
programmer, you
> will
> succeed. have a look at the terachem software out
of the
> group of todd
> martinez for example.
>
> axel.
>
> > PS: to be able to test the results for GTX
and TESLA,
> it is needed QE-GPU code to be run on GTX :-)
> >
> >
> > With the Best Regards
> >
> > Reza Behjatmanesh-Ardakani
> > Associate Professor of Physical
Chemistry
> > Address:
> > Department of Chemistry,
> > School of Science,
> > Payame Noor University (PNU),
> > Ardakan,
> > Yazd,
> > Iran.
> > E-mails:
> >
1- reza_b_m_a at yahoo.com
> (preferred),
> >
2- behjatmanesh at pnu.ac.ir,
> >
3- reza.b.m.a at gmail.com.
> >
_______________________________________________
> > Pw_forum mailing list
> > Pw_forum at pwscf.org
> > http://pwscf.org/mailman/listinfo/pw_forum
>
>
>
> --
> Dr. Axel Kohlmeyer akohlmey at gmail.com
> http://goo.gl/1wk0
> College of Science & Technology, Temple
University,
> Philadelphia PA, USA
> International Centre for Theoretical Physics,
Trieste.
> Italy.
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
--
Dr. Axel Kohlmeyer akohlmey at gmail.com
http://goo.gl/1wk0
College of Science & Technology, Temple University,
Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste.
Italy.
_______________________________________________
Pw_forum mailing list
Pw_forum at pwscf.org
http://pwscf.org/mailman/listinfo/pw_forum
More information about the users
mailing list