[Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x)

Sat Jun 21 11:20:02 CEST 2014

On Sat, Jun 21, 2014 at 4:20 AM, Reza Behjatmanesh-Ardakani
<reza_b_m_a at yahoo.com> wrote:
> Dear Axel
> This was just a proposal. If I am right, Terachem code can use gaming cards for GPU calculations (I saw some of its authors' papers).

yes, but terachem was written from ground up with new algorithms to
avoid loss of precision. in quantum mechanics this is important, since
a lot of calculations depend on comparing large numbers of equal sign
and magnitude and looking at the difference. about the only part of a
plane wave DFT calculation that is "conservative" in terms of
precision without a massive redesign are the FFTs. the loss of
precision is fairly small when replacing double precision FFTs with
single precision ones. for the many 3d-FFTs required, this is
particularly beneficial when trying to scale out via MPI, as this
reduces the amount of bytes that need to be sent and copied around in
half and also reduces the strain on memory bandwidth.

> As you know, the main problem of GTX cards comes back to two important things. One, single precision, and the other lack of ECC.

ECC is a lesser issue. and it is not a problem of single precision,
but lack of performance with double precision due to having only a
fraction of double precision units. another issue is the lack of RAM.
also you have to distinguish between different GTX cards. a few of the
most high-end consumer cards  *do* have the full set of double
precision units and a large amount of RAM.

ECC is mostly relevant for people running a large number of GPUs in a
supercomputer environment.

>
> It is not necessary to write a stand alone code. We can test the QE-GPU with both TESLA and/or GTX and QE (cpu only), and compare the outputs.

but it is pointless to run on a hardware that is not competitive.
you'll have a hard time already to get a 2x speedup from using a top
level tesla card vs. an all CPU run on a decent machine. what would be
the point of having the GPU _decelerate_ your calculation?

in general, a lot of the GPU stuff is hype and misinformation. the
following is a bit old, but still worth a read:
http://www.hpcwire.com/2011/12/13/ten_ways_to_fool_the_masses_when_giving_performance_results_on_gpus/

as a consequence of a very smart and successful PR strategy, there is
now the impression that *any* kind of GPU will result in a *massive*
speedup. even people with a laptop GPU with 2 SMs, no memory bandwidth
are now expecting 100x speedups and more. however, except for a few
corner cases and applications that are very well represented on GPUs
(not very complex) and badly on a CPU, you will often get more like a
2x-5x speedup in a "best effort" comparison of a well equipped host
with a high-end GPU. in part, this situation has become worse with
some choices made by nvidia hardware and software engineers. while 5
years back, the difference between a consumer and a computing GPU was
small, the consumer models have been systematically "downgraded" (via
removing previously supported management features in the driver and
having consumer cards be based on a simplified design that mostly
makes them mid-level GPUs).

> I tested it for only one case (rutile 3*3*2 supercell), and saw that the GTX output is similar to the CPU one.
>
> However, It is needed to test for different cases and different clusters to be sure that the lack of ECC and double precision has no effect on results.

sorry, this statement doesn't make any sense. it looks to me, like you
need to spend some time learning what the technical implications of
ECC and single-vs-double precision are (and the fact that it is the
software that chooses which precision is used, not the hardware)..

whether a card has ECC or not. broken memory is broken memory. and if
it works, it works. so there is not much to test. if you want to find
out, whether your GPU has broken or borderline memory, run the GPU
memtest. it is much more effective at finding issues than any other
application.

where ECC helps is for very long running calculations or calculations
across a very large number of GPUs when a single bitflip can render
the entire effort useless and result in a crash. in a dense cluster
environment or badly cooled desktops, this is a high risk. in a well
setup machine, it is less of a risk, but you have to keep in mind that
running without ECC makes you "blind" for those errors. i run a
cluster with a pile of Tesla GPUs and we have disabled ECC since the
machines run very reliably due to some hacking around restrictions
that nvidia engineers placed in their drivers.
https://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness

we also run consumer level GPUs, particularly in the login nodes,
since they work fine for development and don't cost as outrageously
much as the tesla models. however, for development, absolute
performance is a lesser concern.

> As Filippo said formerly for GTX cards, the output may be not reproducible. However, I think due to the nature of SCF algorithm, the code can be used at least

when you have memory corruption due to bad/overheated memory, no SCF
algorithm will save you. if you go back 10 years when CPUs didn't have
all those power management and automatic self-protection and also
memory modules in desktop were often of very low quality, people
experienced a lot of problems. "signal 11" and "segmentation fault"
were a common topic in many mailing lists on scientific (or other)
software that caused a high CPU load.

but the indication of broken memory was usually a crash due to a
segfault or bad data corruption leading to a massive change in numbers
and often NaNs. once you have a single NaN in your data, it will
spread like a highly infective virus and render the calculation
invalid.

a well set up consumer level GPU will run as reliable as a tesla or
better, only you cannot tell since the nvidia tools will not show you.
the main issue is performance and available memory.

> for VC-RELAX, RELAX, and SCF types of calculations with GTX cards. Of course, it should be tested. Thank you for your interest.

you are not making much sense here either. but if it makes you feel
better to do those tests, don't let me discourage you. sometimes
people learn the best this way.

axel.

> With the Best Regards
>
>    Reza Behjatmanesh-Ardakani
>    Associate Professor of Physical Chemistry
>    Address:
>    Department of Chemistry,
>    School of Science,
>    Payame Noor University (PNU),
>    Ardakan,
>    Yazd,
>    Iran.
>    E-mails:
>           1- reza_b_m_a at yahoo.com (preferred),
>           2- behjatmanesh at pnu.ac.ir,
>           3- reza.b.m.a at gmail.com.
>
> --------------------------------------------
> On Fri, 6/20/14, Axel Kohlmeyer <akohlmey at gmail.com> wrote:
>
>  Subject: Re: [Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x)
>  To: "PWSCF Forum" <pw_forum at pwscf.org>
>  Date: Friday, June 20, 2014, 2:19 PM
>
>  On Fri, Jun 20, 2014 at 4:22 AM, Reza
>  Behjatmanesh-Ardakani
>  <reza_b_m_a at yahoo.com>
>  wrote:
>  > Dear Filippo
>  >
>  > Due to the nature of QE which is iterative, I think
>  lack of ECC and even double precision floating point in
>  gaming cards (GTX) comparing to tesla cards
>  >
>  > is not serious problem for QE-GPU. Some authors have
>  checked this for AMBER molecular dynamics simulation code.
>  See following site:
>
>  classical MD is a very different animal than what you do
>  with QE.
>  errors due to single precision to some properties in
>  classical MD are
>  huge with all single precision calculations. to compute a
>  force from a
>  distance will not be much affected, but summing up the force
>  can
>  already be a problem. "good" classical MD codes usually
>  employ a mixed
>  precision approach, where only the accuracy insensitive
>  parts are done
>  in single precision. for very large system, even double
>  precision can
>  show significant floating point truncation errors. usually
>  you are
>  dependent on error cancellation, too, i.e. when you study a
>  simple
>  homogenous system (as it is quite common in those tests).
>
>
>  >
>  > http://www.hpcwire.com/2014/03/13/ecc-performance-price-worth-gpus
>  >
>  >
>  > and see the following paper:
>  >
>  >
>  >
>  www.rosswalker.co.uk/papers/2014_03_ECC_AMBER_Paper_10.1002_cpe.3232.pdf
>  >
>  >
>  >
>  > I encourage the users of QE-GPU to test it for QE, and
>  report the difference on the site.
>
>  it is a waste of time and effort. people have done DFT and
>  HF in
>  (partial) single precision before and only if you write a
>  new code
>  from scratch and have an extremely skilled programmer, you
>  will
>  succeed. have a look at the terachem software out of the
>  group of todd
>  martinez for example.
>
>  axel.
>
>  > PS: to be able to test the results for GTX and TESLA,
>  it is needed QE-GPU code to be run on GTX :-)
>  >
>  >
>  > With the Best Regards
>  >
>  >    Reza Behjatmanesh-Ardakani
>  >    Associate Professor of Physical Chemistry
>  >    Address:
>  >    Department of Chemistry,
>  >    School of Science,
>  >    Payame Noor University (PNU),
>  >    Ardakan,
>  >    Yazd,
>  >    Iran.
>  >    E-mails:
>  >           1- reza_b_m_a at yahoo.com
>  (preferred),
>  >           2- behjatmanesh at pnu.ac.ir,
>  >           3- reza.b.m.a at gmail.com.
>  > _______________________________________________
>  > Pw_forum mailing list
>  > Pw_forum at pwscf.org
>  > http://pwscf.org/mailman/listinfo/pw_forum
>
>
>
>  --
>  Dr. Axel Kohlmeyer  akohlmey at gmail.com
>  http://goo.gl/1wk0
>  College of Science & Technology, Temple University,
>  Philadelphia PA, USA
>  International Centre for Theoretical Physics, Trieste.
>  Italy.
>  _______________________________________________
>  Pw_forum mailing list
>  Pw_forum at pwscf.org
>  http://pwscf.org/mailman/listinfo/pw_forum
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum

-- 
Dr. Axel Kohlmeyer  akohlmey at gmail.com  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.