[Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x)

Axel Kohlmeyer akohlmey at gmail.com
Mon Jun 23 18:17:34 CEST 2014


On Mon, Jun 23, 2014 at 11:59 AM, David Foster <davidfoster751 at yahoo.com> wrote:
> Dear Axel
> I have some questions in this topic, too. Suppose we use a poor DP floating point performance GTX cards for QE-GPU. Does the codes in
> GPU technology of nVIDIA (such as CUDA and ...) change the DP of QE to SP automatically?  or program ended with error.

neither. if the CUDA (or OpenCL or OpenACC kernel) requests double
precision, it *will* use the available double precision units. only if
you compile for extremely old architecture (1.0 and older IIRC), the
double instructions/data will be silently truncated to single
precision.

i would suggest to make a real-world test to confirm this. e.g. the
force kernels in the GPU package of the LAMMPS classical MD code can
be compiled in all-SP, mixed-P and all-DP and then you could run that
executable on different hardware and see how the relative performance
and accuracy changes.

> If it changes DP to SP automatically, users might get wrong results without any attention.
>
> How about a full SP GPU card for full DP code? Does code run on it?

all SP GPUs are *very* old and i doubt that you can compile QE for
those. older GPU architectures require to program (complicated)
workarounds due to limitations of the architecture. since they also
are slower than newer cards, it is not worth to implement those.
current consumer cards *do* have double precision floating point
support, but usually at a DP:SP ratio of 1:10 rather than the 1:3
present in (kepler based) tesla cards (fermi even has 1:2).

as has been explained many times before. GPU support for QE is a niche
solution. it helps to push calculations farther, especially on
workstation-like machines and large machines with GPUs, where being
able to utilize the GPUs is a prerequisite condition to get access.
particularly with the recent improvements to the CPUs and the increase
in the number of CPU cores and improvements in vector instructions, it
is currently often better to invest effort and money into a better CPU
based solution than including GPUs. learning how to compil/test/use
GPU acceleration would be most useful, if you plan to work on
improving the GPU support and adding new features.

in general, the problem with GPUs is that they are not as general
purpose as people are made to believe. on the contrary, GPUs are very
extreme and thus require "extreme" programming that puts consideration
for the architecture first and physics second (i.e. the opposite of
what is common in scientific computing). that doesn't mean that it is
useless, since most tricks and strategies that are a necessity for
GPUs (because they are so extreme) will also be beneficial for
multi-threaded and vectorized. in fact, i personally have mostly
abandoned GPU programming, in favor of adding multi-threading and
vectorization to code, but my multi-thread programming is heavily
influenced by what i learned from trying to program GPUs, and these
makes the code much better than what i wrote before.

HTH,
    axel.


>
> Regards
>
> David Foster
>
> Ph.D. Student of Chemistry
>
> --------------------------------------------
> On Sat, 6/21/14, Axel Kohlmeyer <akohlmey at gmail.com> wrote:
>
>  Subject: Re: [Pw_forum] A "relax" input runs on CPU (pw.x) but not on  CPU-GPU (pw-gpu.x)
>  To: "PWSCF Forum" <pw_forum at pwscf.org>
>  Date: Saturday, June 21, 2014, 2:20 AM
>
>  On Sat, Jun 21, 2014 at 4:20 AM, Reza
>  Behjatmanesh-Ardakani
>  <reza_b_m_a at yahoo.com>
>  wrote:
>  > Dear Axel
>  > This was just a proposal. If I am right, Terachem code
>  can use gaming cards for GPU calculations (I saw some of its
>  authors' papers).
>
>  yes, but terachem was written from ground up with new
>  algorithms to
>  avoid loss of precision. in quantum mechanics this is
>  important, since
>  a lot of calculations depend on comparing large numbers of
>  equal sign
>  and magnitude and looking at the difference. about the only
>  part of a
>  plane wave DFT calculation that is "conservative" in terms
>  of
>  precision without a massive redesign are the FFTs. the loss
>  of
>  precision is fairly small when replacing double precision
>  FFTs with
>  single precision ones. for the many 3d-FFTs required, this
>  is
>  particularly beneficial when trying to scale out via MPI, as
>  this
>  reduces the amount of bytes that need to be sent and copied
>  around in
>  half and also reduces the strain on memory bandwidth.
>
>  > As you know, the main problem of GTX cards comes back
>  to two important things. One, single precision, and the
>  other lack of ECC.
>
>  ECC is a lesser issue. and it is not a problem of single
>  precision,
>  but lack of performance with double precision due to having
>  only a
>  fraction of double precision units. another issue is the
>  lack of RAM.
>  also you have to distinguish between different GTX cards. a
>  few of the
>  most high-end consumer cards  *do* have the full set of
>  double
>  precision units and a large amount of RAM.
>
>  ECC is mostly relevant for people running a large number of
>  GPUs in a
>  supercomputer environment.
>
>  >
>  > It is not necessary to write a stand alone code. We can
>  test the QE-GPU with both TESLA and/or GTX and QE (cpu
>  only), and compare the outputs.
>
>  but it is pointless to run on a hardware that is not
>  competitive.
>  you'll have a hard time already to get a 2x speedup from
>  using a top
>  level tesla card vs. an all CPU run on a decent machine.
>  what would be
>  the point of having the GPU _decelerate_ your calculation?
>
>  in general, a lot of the GPU stuff is hype and
>  misinformation. the
>  following is a bit old, but still worth a read:
>  http://www.hpcwire.com/2011/12/13/ten_ways_to_fool_the_masses_when_giving_performance_results_on_gpus/
>
>  as a consequence of a very smart and successful PR strategy,
>  there is
>  now the impression that *any* kind of GPU will result in a
>  *massive*
>  speedup. even people with a laptop GPU with 2 SMs, no memory
>  bandwidth
>  are now expecting 100x speedups and more. however, except
>  for a few
>  corner cases and applications that are very well represented
>  on GPUs
>  (not very complex) and badly on a CPU, you will often get
>  more like a
>  2x-5x speedup in a "best effort" comparison of a well
>  equipped host
>  with a high-end GPU. in part, this situation has become
>  worse with
>  some choices made by nvidia hardware and software engineers.
>  while 5
>  years back, the difference between a consumer and a
>  computing GPU was
>  small, the consumer models have been systematically
>  "downgraded" (via
>  removing previously supported management features in the
>  driver and
>  having consumer cards be based on a simplified design that
>  mostly
>  makes them mid-level GPUs).
>
>  > I tested it for only one case (rutile 3*3*2 supercell),
>  and saw that the GTX output is similar to the CPU one.
>  >
>  > However, It is needed to test for different cases and
>  different clusters to be sure that the lack of ECC and
>  double precision has no effect on results.
>
>  sorry, this statement doesn't make any sense. it looks to
>  me, like you
>  need to spend some time learning what the technical
>  implications of
>  ECC and single-vs-double precision are (and the fact that it
>  is the
>  software that chooses which precision is used, not the
>  hardware)..
>
>  whether a card has ECC or not. broken memory is broken
>  memory. and if
>  it works, it works. so there is not much to test. if you
>  want to find
>  out, whether your GPU has broken or borderline memory, run
>  the GPU
>  memtest. it is much more effective at finding issues than
>  any other
>  application.
>
>  where ECC helps is for very long running calculations or
>  calculations
>  across a very large number of GPUs when a single bitflip can
>  render
>  the entire effort useless and result in a crash. in a dense
>  cluster
>  environment or badly cooled desktops, this is a high risk.
>  in a well
>  setup machine, it is less of a risk, but you have to keep in
>  mind that
>  running without ECC makes you "blind" for those errors. i
>  run a
>  cluster with a pile of Tesla GPUs and we have disabled ECC
>  since the
>  machines run very reliably due to some hacking around
>  restrictions
>  that nvidia engineers placed in their drivers.
>  https://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness
>
>  we also run consumer level GPUs, particularly in the login
>  nodes,
>  since they work fine for development and don't cost as
>  outrageously
>  much as the tesla models. however, for development,
>  absolute
>  performance is a lesser concern.
>
>  > As Filippo said formerly for GTX cards, the output may
>  be not reproducible. However, I think due to the nature of
>  SCF algorithm, the code can be used at least
>
>  when you have memory corruption due to bad/overheated
>  memory, no SCF
>  algorithm will save you. if you go back 10 years when CPUs
>  didn't have
>  all those power management and automatic self-protection and
>  also
>  memory modules in desktop were often of very low quality,
>  people
>  experienced a lot of problems. "signal 11" and "segmentation
>  fault"
>  were a common topic in many mailing lists on scientific (or
>  other)
>  software that caused a high CPU load.
>
>  but the indication of broken memory was usually a crash due
>  to a
>  segfault or bad data corruption leading to a massive change
>  in numbers
>  and often NaNs. once you have a single NaN in your data, it
>  will
>  spread like a highly infective virus and render the
>  calculation
>  invalid.
>
>  a well set up consumer level GPU will run as reliable as a
>  tesla or
>  better, only you cannot tell since the nvidia tools will not
>  show you.
>  the main issue is performance and available memory.
>
>  > for VC-RELAX, RELAX, and SCF types of calculations with
>  GTX cards. Of course, it should be tested. Thank you for
>  your interest.
>
>  you are not making much sense here either. but if it makes
>  you feel
>  better to do those tests, don't let me discourage you.
>  sometimes
>  people learn the best this way.
>
>  axel.
>
>
>  > With the Best Regards
>  >
>  >    Reza Behjatmanesh-Ardakani
>  >    Associate Professor of Physical Chemistry
>  >    Address:
>  >    Department of Chemistry,
>  >    School of Science,
>  >    Payame Noor University (PNU),
>  >    Ardakan,
>  >    Yazd,
>  >    Iran.
>  >    E-mails:
>  >           1- reza_b_m_a at yahoo.com
>  (preferred),
>  >           2- behjatmanesh at pnu.ac.ir,
>  >           3- reza.b.m.a at gmail.com.
>  >
>  > --------------------------------------------
>  > On Fri, 6/20/14, Axel Kohlmeyer <akohlmey at gmail.com>
>  wrote:
>  >
>  >  Subject: Re: [Pw_forum] A "relax" input runs on
>  CPU (pw.x) but not on CPU-GPU (pw-gpu.x)
>  >  To: "PWSCF Forum" <pw_forum at pwscf.org>
>  >  Date: Friday, June 20, 2014, 2:19 PM
>  >
>  >  On Fri, Jun 20, 2014 at 4:22 AM, Reza
>  >  Behjatmanesh-Ardakani
>  >  <reza_b_m_a at yahoo.com>
>  >  wrote:
>  >  > Dear Filippo
>  >  >
>  >  > Due to the nature of QE which is iterative,
>  I think
>  >  lack of ECC and even double precision floating
>  point in
>  >  gaming cards (GTX) comparing to tesla cards
>  >  >
>  >  > is not serious problem for QE-GPU. Some
>  authors have
>  >  checked this for AMBER molecular dynamics
>  simulation code.
>  >  See following site:
>  >
>  >  classical MD is a very different animal than what
>  you do
>  >  with QE.
>  >  errors due to single precision to some properties
>  in
>  >  classical MD are
>  >  huge with all single precision calculations. to
>  compute a
>  >  force from a
>  >  distance will not be much affected, but summing
>  up the force
>  >  can
>  >  already be a problem. "good" classical MD codes
>  usually
>  >  employ a mixed
>  >  precision approach, where only the accuracy
>  insensitive
>  >  parts are done
>  >  in single precision. for very large system, even
>  double
>  >  precision can
>  >  show significant floating point truncation
>  errors. usually
>  >  you are
>  >  dependent on error cancellation, too, i.e. when
>  you study a
>  >  simple
>  >  homogenous system (as it is quite common in those
>  tests).
>  >
>  >
>  >  >
>  >  > http://www.hpcwire.com/2014/03/13/ecc-performance-price-worth-gpus
>  >  >
>  >  >
>  >  > and see the following paper:
>  >  >
>  >  >
>  >  >
>  >
>  www.rosswalker.co.uk/papers/2014_03_ECC_AMBER_Paper_10.1002_cpe.3232.pdf
>  >  >
>  >  >
>  >  >
>  >  > I encourage the users of QE-GPU to test it
>  for QE, and
>  >  report the difference on the site.
>  >
>  >  it is a waste of time and effort. people have
>  done DFT and
>  >  HF in
>  >  (partial) single precision before and only if you
>  write a
>  >  new code
>  >  from scratch and have an extremely skilled
>  programmer, you
>  >  will
>  >  succeed. have a look at the terachem software out
>  of the
>  >  group of todd
>  >  martinez for example.
>  >
>  >  axel.
>  >
>  >  > PS: to be able to test the results for GTX
>  and TESLA,
>  >  it is needed QE-GPU code to be run on GTX :-)
>  >  >
>  >  >
>  >  > With the Best Regards
>  >  >
>  >  >    Reza Behjatmanesh-Ardakani
>  >  >    Associate Professor of Physical
>  Chemistry
>  >  >    Address:
>  >  >    Department of Chemistry,
>  >  >    School of Science,
>  >  >    Payame Noor University (PNU),
>  >  >    Ardakan,
>  >  >    Yazd,
>  >  >    Iran.
>  >  >    E-mails:
>  >  >
>     1- reza_b_m_a at yahoo.com
>  >  (preferred),
>  >  >
>     2- behjatmanesh at pnu.ac.ir,
>  >  >
>     3- reza.b.m.a at gmail.com.
>  >  >
>  _______________________________________________
>  >  > Pw_forum mailing list
>  >  > Pw_forum at pwscf.org
>  >  > http://pwscf.org/mailman/listinfo/pw_forum
>  >
>  >
>  >
>  >  --
>  >  Dr. Axel Kohlmeyer  akohlmey at gmail.com
>  >  http://goo.gl/1wk0
>  >  College of Science & Technology, Temple
>  University,
>  >  Philadelphia PA, USA
>  >  International Centre for Theoretical Physics,
>  Trieste.
>  >  Italy.
>  >  _______________________________________________
>  >  Pw_forum mailing list
>  >  Pw_forum at pwscf.org
>  >  http://pwscf.org/mailman/listinfo/pw_forum
>  >
>  >
>  > _______________________________________________
>  > Pw_forum mailing list
>  > Pw_forum at pwscf.org
>  > http://pwscf.org/mailman/listinfo/pw_forum
>
>
>
>  --
>  Dr. Axel Kohlmeyer  akohlmey at gmail.com
>  http://goo.gl/1wk0
>  College of Science & Technology, Temple University,
>  Philadelphia PA, USA
>  International Centre for Theoretical Physics, Trieste.
>  Italy.
>  _______________________________________________
>  Pw_forum mailing list
>  Pw_forum at pwscf.org
>  http://pwscf.org/mailman/listinfo/pw_forum
>
>
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://pwscf.org/mailman/listinfo/pw_forum



-- 
Dr. Axel Kohlmeyer  akohlmey at gmail.com  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.



More information about the users mailing list