[Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x)
Axel Kohlmeyer
akohlmey at gmail.com
Sun Jun 22 20:02:20 CEST 2014
On Sun, Jun 22, 2014 at 3:12 AM, Reza Behjatmanesh-Ardakani
<reza_b_m_a at yahoo.com> wrote:
> Dear Axel
> Thank you. It was very helpful for me.
> As you said some new GTX cards have good DP floating point performance such as GTX Ti Black or GTX Ti Z that for both DP is 1/3 of SP.
> They are much cheaper than Tesla cards.
> I am not sure that Ti Black or Ti Z has ECC.
no it hasn't.
> Quadro K6000 has it.
well, the quadro is practically a tesla with all graphics features
enabled. ...at a price.
> Thanks again.
>
> With the Best Regards
>
> Reza Behjatmanesh-Ardakani
> Associate Professor of Physical Chemistry
> Address:
> Department of Chemistry,
> School of Science,
> Payame Noor University (PNU),
> Ardakan,
> Yazd,
> Iran.
> E-mails:
> 1- reza_b_m_a at yahoo.com (preferred),
> 2- behjatmanesh at pnu.ac.ir,
> 3- reza.b.m.a at gmail.com.
>
> --------------------------------------------
> On Sat, 6/21/14, Axel Kohlmeyer <akohlmey at gmail.com> wrote:
>
> Subject: Re: [Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x)
> To: "PWSCF Forum" <pw_forum at pwscf.org>
> Date: Saturday, June 21, 2014, 1:50 PM
>
> On Sat, Jun 21, 2014 at 4:20 AM, Reza
> Behjatmanesh-Ardakani
> <reza_b_m_a at yahoo.com>
> wrote:
> > Dear Axel
> > This was just a proposal. If I am right, Terachem code
> can use gaming cards for GPU calculations (I saw some of its
> authors' papers).
>
> yes, but terachem was written from ground up with new
> algorithms to
> avoid loss of precision. in quantum mechanics this is
> important, since
> a lot of calculations depend on comparing large numbers of
> equal sign
> and magnitude and looking at the difference. about the only
> part of a
> plane wave DFT calculation that is "conservative" in terms
> of
> precision without a massive redesign are the FFTs. the loss
> of
> precision is fairly small when replacing double precision
> FFTs with
> single precision ones. for the many 3d-FFTs required, this
> is
> particularly beneficial when trying to scale out via MPI, as
> this
> reduces the amount of bytes that need to be sent and copied
> around in
> half and also reduces the strain on memory bandwidth.
>
> > As you know, the main problem of GTX cards comes back
> to two important things. One, single precision, and the
> other lack of ECC.
>
> ECC is a lesser issue. and it is not a problem of single
> precision,
> but lack of performance with double precision due to having
> only a
> fraction of double precision units. another issue is the
> lack of RAM.
> also you have to distinguish between different GTX cards. a
> few of the
> most high-end consumer cards *do* have the full set of
> double
> precision units and a large amount of RAM.
>
> ECC is mostly relevant for people running a large number of
> GPUs in a
> supercomputer environment.
>
> >
> > It is not necessary to write a stand alone code. We can
> test the QE-GPU with both TESLA and/or GTX and QE (cpu
> only), and compare the outputs.
>
> but it is pointless to run on a hardware that is not
> competitive.
> you'll have a hard time already to get a 2x speedup from
> using a top
> level tesla card vs. an all CPU run on a decent machine.
> what would be
> the point of having the GPU _decelerate_ your calculation?
>
> in general, a lot of the GPU stuff is hype and
> misinformation. the
> following is a bit old, but still worth a read:
> http://www.hpcwire.com/2011/12/13/ten_ways_to_fool_the_masses_when_giving_performance_results_on_gpus/
>
> as a consequence of a very smart and successful PR strategy,
> there is
> now the impression that *any* kind of GPU will result in a
> *massive*
> speedup. even people with a laptop GPU with 2 SMs, no memory
> bandwidth
> are now expecting 100x speedups and more. however, except
> for a few
> corner cases and applications that are very well represented
> on GPUs
> (not very complex) and badly on a CPU, you will often get
> more like a
> 2x-5x speedup in a "best effort" comparison of a well
> equipped host
> with a high-end GPU. in part, this situation has become
> worse with
> some choices made by nvidia hardware and software engineers.
> while 5
> years back, the difference between a consumer and a
> computing GPU was
> small, the consumer models have been systematically
> "downgraded" (via
> removing previously supported management features in the
> driver and
> having consumer cards be based on a simplified design that
> mostly
> makes them mid-level GPUs).
>
> > I tested it for only one case (rutile 3*3*2 supercell),
> and saw that the GTX output is similar to the CPU one.
> >
> > However, It is needed to test for different cases and
> different clusters to be sure that the lack of ECC and
> double precision has no effect on results.
>
> sorry, this statement doesn't make any sense. it looks to
> me, like you
> need to spend some time learning what the technical
> implications of
> ECC and single-vs-double precision are (and the fact that it
> is the
> software that chooses which precision is used, not the
> hardware)..
>
> whether a card has ECC or not. broken memory is broken
> memory. and if
> it works, it works. so there is not much to test. if you
> want to find
> out, whether your GPU has broken or borderline memory, run
> the GPU
> memtest. it is much more effective at finding issues than
> any other
> application.
>
> where ECC helps is for very long running calculations or
> calculations
> across a very large number of GPUs when a single bitflip can
> render
> the entire effort useless and result in a crash. in a dense
> cluster
> environment or badly cooled desktops, this is a high risk.
> in a well
> setup machine, it is less of a risk, but you have to keep in
> mind that
> running without ECC makes you "blind" for those errors. i
> run a
> cluster with a pile of Tesla GPUs and we have disabled ECC
> since the
> machines run very reliably due to some hacking around
> restrictions
> that nvidia engineers placed in their drivers.
> https://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness
>
> we also run consumer level GPUs, particularly in the login
> nodes,
> since they work fine for development and don't cost as
> outrageously
> much as the tesla models. however, for development,
> absolute
> performance is a lesser concern.
>
> > As Filippo said formerly for GTX cards, the output may
> be not reproducible. However, I think due to the nature of
> SCF algorithm, the code can be used at least
>
> when you have memory corruption due to bad/overheated
> memory, no SCF
> algorithm will save you. if you go back 10 years when CPUs
> didn't have
> all those power management and automatic self-protection and
> also
> memory modules in desktop were often of very low quality,
> people
> experienced a lot of problems. "signal 11" and "segmentation
> fault"
> were a common topic in many mailing lists on scientific (or
> other)
> software that caused a high CPU load.
>
> but the indication of broken memory was usually a crash due
> to a
> segfault or bad data corruption leading to a massive change
> in numbers
> and often NaNs. once you have a single NaN in your data, it
> will
> spread like a highly infective virus and render the
> calculation
> invalid.
>
> a well set up consumer level GPU will run as reliable as a
> tesla or
> better, only you cannot tell since the nvidia tools will not
> show you.
> the main issue is performance and available memory.
>
> > for VC-RELAX, RELAX, and SCF types of calculations with
> GTX cards. Of course, it should be tested. Thank you for
> your interest.
>
> you are not making much sense here either. but if it makes
> you feel
> better to do those tests, don't let me discourage you.
> sometimes
> people learn the best this way.
>
> axel.
>
>
> > With the Best Regards
> >
> > Reza Behjatmanesh-Ardakani
> > Associate Professor of Physical Chemistry
> > Address:
> > Department of Chemistry,
> > School of Science,
> > Payame Noor University (PNU),
> > Ardakan,
> > Yazd,
> > Iran.
> > E-mails:
> > 1- reza_b_m_a at yahoo.com
> (preferred),
> > 2- behjatmanesh at pnu.ac.ir,
> > 3- reza.b.m.a at gmail.com.
> >
> > --------------------------------------------
> > On Fri, 6/20/14, Axel Kohlmeyer <akohlmey at gmail.com>
> wrote:
> >
> > Subject: Re: [Pw_forum] A "relax" input runs on
> CPU (pw.x) but not on CPU-GPU (pw-gpu.x)
> > To: "PWSCF Forum" <pw_forum at pwscf.org>
> > Date: Friday, June 20, 2014, 2:19 PM
> >
> > On Fri, Jun 20, 2014 at 4:22 AM, Reza
> > Behjatmanesh-Ardakani
> > <reza_b_m_a at yahoo.com>
> > wrote:
> > > Dear Filippo
> > >
> > > Due to the nature of QE which is iterative,
> I think
> > lack of ECC and even double precision floating
> point in
> > gaming cards (GTX) comparing to tesla cards
> > >
> > > is not serious problem for QE-GPU. Some
> authors have
> > checked this for AMBER molecular dynamics
> simulation code.
> > See following site:
> >
> > classical MD is a very different animal than what
> you do
> > with QE.
> > errors due to single precision to some properties
> in
> > classical MD are
> > huge with all single precision calculations. to
> compute a
> > force from a
> > distance will not be much affected, but summing
> up the force
> > can
> > already be a problem. "good" classical MD codes
> usually
> > employ a mixed
> > precision approach, where only the accuracy
> insensitive
> > parts are done
> > in single precision. for very large system, even
> double
> > precision can
> > show significant floating point truncation
> errors. usually
> > you are
> > dependent on error cancellation, too, i.e. when
> you study a
> > simple
> > homogenous system (as it is quite common in those
> tests).
> >
> >
> > >
> > > http://www.hpcwire.com/2014/03/13/ecc-performance-price-worth-gpus
> > >
> > >
> > > and see the following paper:
> > >
> > >
> > >
> >
> www.rosswalker.co.uk/papers/2014_03_ECC_AMBER_Paper_10.1002_cpe.3232.pdf
> > >
> > >
> > >
> > > I encourage the users of QE-GPU to test it
> for QE, and
> > report the difference on the site.
> >
> > it is a waste of time and effort. people have
> done DFT and
> > HF in
> > (partial) single precision before and only if you
> write a
> > new code
> > from scratch and have an extremely skilled
> programmer, you
> > will
> > succeed. have a look at the terachem software out
> of the
> > group of todd
> > martinez for example.
> >
> > axel.
> >
> > > PS: to be able to test the results for GTX
> and TESLA,
> > it is needed QE-GPU code to be run on GTX :-)
> > >
> > >
> > > With the Best Regards
> > >
> > > Reza Behjatmanesh-Ardakani
> > > Associate Professor of Physical
> Chemistry
> > > Address:
> > > Department of Chemistry,
> > > School of Science,
> > > Payame Noor University (PNU),
> > > Ardakan,
> > > Yazd,
> > > Iran.
> > > E-mails:
> > >
> 1- reza_b_m_a at yahoo.com
> > (preferred),
> > >
> 2- behjatmanesh at pnu.ac.ir,
> > >
> 3- reza.b.m.a at gmail.com.
> > >
> _______________________________________________
> > > Pw_forum mailing list
> > > Pw_forum at pwscf.org
> > > http://pwscf.org/mailman/listinfo/pw_forum
> >
> >
> >
> > --
> > Dr. Axel Kohlmeyer akohlmey at gmail.com
> > http://goo.gl/1wk0
> > College of Science & Technology, Temple
> University,
> > Philadelphia PA, USA
> > International Centre for Theoretical Physics,
> Trieste.
> > Italy.
> > _______________________________________________
> > Pw_forum mailing list
> > Pw_forum at pwscf.org
> > http://pwscf.org/mailman/listinfo/pw_forum
> >
> >
> > _______________________________________________
> > Pw_forum mailing list
> > Pw_forum at pwscf.org
> > http://pwscf.org/mailman/listinfo/pw_forum
>
>
>
> --
> Dr. Axel Kohlmeyer akohlmey at gmail.com
> http://goo.gl/1wk0
> College of Science & Technology, Temple University,
> Philadelphia PA, USA
> International Centre for Theoretical Physics, Trieste.
> Italy.
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum
--
Dr. Axel Kohlmeyer akohlmey at gmail.com http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.
More information about the users
mailing list