[Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x)

Sun Jun 22 09:12:45 CEST 2014

Dear Axel
Thank you. It was very helpful for me. 
As you said some new GTX cards have good DP floating point performance such as GTX Ti Black or GTX Ti Z that for both DP is 1/3 of SP.
They are much cheaper than Tesla cards. 
I am not sure that Ti Black or Ti Z has ECC. Quadro K6000 has it.

Thanks again.

With the Best Regards

   Reza Behjatmanesh-Ardakani
   Associate Professor of Physical Chemistry
   Address:
   Department of Chemistry,
   School of Science,
   Payame Noor University (PNU),
   Ardakan,
   Yazd,
   Iran.
   E-mails: 
          1- reza_b_m_a at yahoo.com (preferred),
          2- behjatmanesh at pnu.ac.ir, 
          3- reza.b.m.a at gmail.com.

--------------------------------------------
On Sat, 6/21/14, Axel Kohlmeyer <akohlmey at gmail.com> wrote:

 Subject: Re: [Pw_forum] A "relax" input runs on CPU (pw.x) but not on CPU-GPU (pw-gpu.x)
 To: "PWSCF Forum" <pw_forum at pwscf.org>
 Date: Saturday, June 21, 2014, 1:50 PM

 On Sat, Jun 21, 2014 at 4:20 AM, Reza
 Behjatmanesh-Ardakani
 <reza_b_m_a at yahoo.com>
 wrote:
 > Dear Axel
 > This was just a proposal. If I am right, Terachem code
 can use gaming cards for GPU calculations (I saw some of its
 authors' papers).

 yes, but terachem was written from ground up with new
 algorithms to
 avoid loss of precision. in quantum mechanics this is
 important, since
 a lot of calculations depend on comparing large numbers of
 equal sign
 and magnitude and looking at the difference. about the only
 part of a
 plane wave DFT calculation that is "conservative" in terms
 of
 precision without a massive redesign are the FFTs. the loss
 of
 precision is fairly small when replacing double precision
 FFTs with
 single precision ones. for the many 3d-FFTs required, this
 is
 particularly beneficial when trying to scale out via MPI, as
 this
 reduces the amount of bytes that need to be sent and copied
 around in
 half and also reduces the strain on memory bandwidth.

 > As you know, the main problem of GTX cards comes back
 to two important things. One, single precision, and the
 other lack of ECC.

 ECC is a lesser issue. and it is not a problem of single
 precision,
 but lack of performance with double precision due to having
 only a
 fraction of double precision units. another issue is the
 lack of RAM.
 also you have to distinguish between different GTX cards. a
 few of the
 most high-end consumer cards  *do* have the full set of
 double
 precision units and a large amount of RAM.

 ECC is mostly relevant for people running a large number of
 GPUs in a
 supercomputer environment.

 >
 > It is not necessary to write a stand alone code. We can
 test the QE-GPU with both TESLA and/or GTX and QE (cpu
 only), and compare the outputs.

 but it is pointless to run on a hardware that is not
 competitive.
 you'll have a hard time already to get a 2x speedup from
 using a top
 level tesla card vs. an all CPU run on a decent machine.
 what would be
 the point of having the GPU _decelerate_ your calculation?

 in general, a lot of the GPU stuff is hype and
 misinformation. the
 following is a bit old, but still worth a read:
 http://www.hpcwire.com/2011/12/13/ten_ways_to_fool_the_masses_when_giving_performance_results_on_gpus/

 as a consequence of a very smart and successful PR strategy,
 there is
 now the impression that *any* kind of GPU will result in a
 *massive*
 speedup. even people with a laptop GPU with 2 SMs, no memory
 bandwidth
 are now expecting 100x speedups and more. however, except
 for a few
 corner cases and applications that are very well represented
 on GPUs
 (not very complex) and badly on a CPU, you will often get
 more like a
 2x-5x speedup in a "best effort" comparison of a well
 equipped host
 with a high-end GPU. in part, this situation has become
 worse with
 some choices made by nvidia hardware and software engineers.
 while 5
 years back, the difference between a consumer and a
 computing GPU was
 small, the consumer models have been systematically
 "downgraded" (via
 removing previously supported management features in the
 driver and
 having consumer cards be based on a simplified design that
 mostly
 makes them mid-level GPUs).

 > I tested it for only one case (rutile 3*3*2 supercell),
 and saw that the GTX output is similar to the CPU one.
 >
 > However, It is needed to test for different cases and
 different clusters to be sure that the lack of ECC and
 double precision has no effect on results.

 sorry, this statement doesn't make any sense. it looks to
 me, like you
 need to spend some time learning what the technical
 implications of
 ECC and single-vs-double precision are (and the fact that it
 is the
 software that chooses which precision is used, not the
 hardware)..

 whether a card has ECC or not. broken memory is broken
 memory. and if
 it works, it works. so there is not much to test. if you
 want to find
 out, whether your GPU has broken or borderline memory, run
 the GPU
 memtest. it is much more effective at finding issues than
 any other
 application.

 where ECC helps is for very long running calculations or
 calculations
 across a very large number of GPUs when a single bitflip can
 render
 the entire effort useless and result in a crash. in a dense
 cluster
 environment or badly cooled desktops, this is a high risk.
 in a well
 setup machine, it is less of a risk, but you have to keep in
 mind that
 running without ECC makes you "blind" for those errors. i
 run a
 cluster with a pile of Tesla GPUs and we have disabled ECC
 since the
 machines run very reliably due to some hacking around
 restrictions
 that nvidia engineers placed in their drivers.
 https://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness

 we also run consumer level GPUs, particularly in the login
 nodes,
 since they work fine for development and don't cost as
 outrageously
 much as the tesla models. however, for development,
 absolute
 performance is a lesser concern.

 > As Filippo said formerly for GTX cards, the output may
 be not reproducible. However, I think due to the nature of
 SCF algorithm, the code can be used at least

 when you have memory corruption due to bad/overheated
 memory, no SCF
 algorithm will save you. if you go back 10 years when CPUs
 didn't have
 all those power management and automatic self-protection and
 also
 memory modules in desktop were often of very low quality,
 people
 experienced a lot of problems. "signal 11" and "segmentation
 fault"
 were a common topic in many mailing lists on scientific (or
 other)
 software that caused a high CPU load.

 but the indication of broken memory was usually a crash due
 to a
 segfault or bad data corruption leading to a massive change
 in numbers
 and often NaNs. once you have a single NaN in your data, it
 will
 spread like a highly infective virus and render the
 calculation
 invalid.

 a well set up consumer level GPU will run as reliable as a
 tesla or
 better, only you cannot tell since the nvidia tools will not
 show you.
 the main issue is performance and available memory.

 > for VC-RELAX, RELAX, and SCF types of calculations with
 GTX cards. Of course, it should be tested. Thank you for
 your interest.

 you are not making much sense here either. but if it makes
 you feel
 better to do those tests, don't let me discourage you.
 sometimes
 people learn the best this way.

 axel.

 > With the Best Regards
 >
 >    Reza Behjatmanesh-Ardakani
 >    Associate Professor of Physical Chemistry
 >    Address:
 >    Department of Chemistry,
 >    School of Science,
 >    Payame Noor University (PNU),
 >    Ardakan,
 >    Yazd,
 >    Iran.
 >    E-mails:
 >           1- reza_b_m_a at yahoo.com
 (preferred),
 >           2- behjatmanesh at pnu.ac.ir,
 >           3- reza.b.m.a at gmail.com.
 >
 > --------------------------------------------
 > On Fri, 6/20/14, Axel Kohlmeyer <akohlmey at gmail.com>
 wrote:
 >
 >  Subject: Re: [Pw_forum] A "relax" input runs on
 CPU (pw.x) but not on CPU-GPU (pw-gpu.x)
 >  To: "PWSCF Forum" <pw_forum at pwscf.org>
 >  Date: Friday, June 20, 2014, 2:19 PM
 >
 >  On Fri, Jun 20, 2014 at 4:22 AM, Reza
 >  Behjatmanesh-Ardakani
 >  <reza_b_m_a at yahoo.com>
 >  wrote:
 >  > Dear Filippo
 >  >
 >  > Due to the nature of QE which is iterative,
 I think
 >  lack of ECC and even double precision floating
 point in
 >  gaming cards (GTX) comparing to tesla cards
 >  >
 >  > is not serious problem for QE-GPU. Some
 authors have
 >  checked this for AMBER molecular dynamics
 simulation code.
 >  See following site:
 >
 >  classical MD is a very different animal than what
 you do
 >  with QE.
 >  errors due to single precision to some properties
 in
 >  classical MD are
 >  huge with all single precision calculations. to
 compute a
 >  force from a
 >  distance will not be much affected, but summing
 up the force
 >  can
 >  already be a problem. "good" classical MD codes
 usually
 >  employ a mixed
 >  precision approach, where only the accuracy
 insensitive
 >  parts are done
 >  in single precision. for very large system, even
 double
 >  precision can
 >  show significant floating point truncation
 errors. usually
 >  you are
 >  dependent on error cancellation, too, i.e. when
 you study a
 >  simple
 >  homogenous system (as it is quite common in those
 tests).
 >
 >
 >  >
 >  > http://www.hpcwire.com/2014/03/13/ecc-performance-price-worth-gpus
 >  >
 >  >
 >  > and see the following paper:
 >  >
 >  >
 >  >
 > 
 www.rosswalker.co.uk/papers/2014_03_ECC_AMBER_Paper_10.1002_cpe.3232.pdf
 >  >
 >  >
 >  >
 >  > I encourage the users of QE-GPU to test it
 for QE, and
 >  report the difference on the site.
 >
 >  it is a waste of time and effort. people have
 done DFT and
 >  HF in
 >  (partial) single precision before and only if you
 write a
 >  new code
 >  from scratch and have an extremely skilled
 programmer, you
 >  will
 >  succeed. have a look at the terachem software out
 of the
 >  group of todd
 >  martinez for example.
 >
 >  axel.
 >
 >  > PS: to be able to test the results for GTX
 and TESLA,
 >  it is needed QE-GPU code to be run on GTX :-)
 >  >
 >  >
 >  > With the Best Regards
 >  >
 >  >    Reza Behjatmanesh-Ardakani
 >  >    Associate Professor of Physical
 Chemistry
 >  >    Address:
 >  >    Department of Chemistry,
 >  >    School of Science,
 >  >    Payame Noor University (PNU),
 >  >    Ardakan,
 >  >    Yazd,
 >  >    Iran.
 >  >    E-mails:
 >  >       
    1- reza_b_m_a at yahoo.com
 >  (preferred),
 >  >       
    2- behjatmanesh at pnu.ac.ir,
 >  >       
    3- reza.b.m.a at gmail.com.
 >  >
 _______________________________________________
 >  > Pw_forum mailing list
 >  > Pw_forum at pwscf.org
 >  > http://pwscf.org/mailman/listinfo/pw_forum
 >
 >
 >
 >  --
 >  Dr. Axel Kohlmeyer  akohlmey at gmail.com
 >  http://goo.gl/1wk0
 >  College of Science & Technology, Temple
 University,
 >  Philadelphia PA, USA
 >  International Centre for Theoretical Physics,
 Trieste.
 >  Italy.
 >  _______________________________________________
 >  Pw_forum mailing list
 >  Pw_forum at pwscf.org
 >  http://pwscf.org/mailman/listinfo/pw_forum
 >
 >
 > _______________________________________________
 > Pw_forum mailing list
 > Pw_forum at pwscf.org
 > http://pwscf.org/mailman/listinfo/pw_forum

 -- 
 Dr. Axel Kohlmeyer  akohlmey at gmail.com 
 http://goo.gl/1wk0
 College of Science & Technology, Temple University,
 Philadelphia PA, USA
 International Centre for Theoretical Physics, Trieste.
 Italy.
 _______________________________________________
 Pw_forum mailing list
 Pw_forum at pwscf.org
 http://pwscf.org/mailman/listinfo/pw_forum