[Q-e-developers] diagonalization with multithreads libs slow: comparison between Jade (cines) and Fermi (cineca)

Mon Oct 1 14:12:47 CEST 2012

I take this chance to suggest you to try to use the ELPA libraries. At this moment, ELPA have been implemented only for real matrices (the case with hermitian matrices produces some errors in cholesky..).
ELPA libraries have been built upon SCALAPACK. So if you want to compile you must enable both ELPA and SCALAPACK (use --with-scalapack and --with-elpa in the configure).

I have run a CP dynamics with 256 water molecules, 200 time steps launched with:
1024 MPI tasks
128 BG size
4 OpenMP threads per task
ndiag = 256

This is the comparison of SCALAPACK only and SCALAPACK+ELPA:

SCALASCA only:
     Called by main_loop:
     move_electro :    676.91s CPU    676.91s WALL (     200 calls)
     ortho        :    379.86s CPU    379.86s WALL (     201 calls)
     updatc       :     47.18s CPU     47.18s WALL (     201 calls)
     strucf       :      0.25s CPU      0.25s WALL (       1 calls)
     calbec       :      7.86s CPU      7.86s WALL (     202 calls)
     rsg          :    244.54s CPU    244.54s WALL (     201 calls)

ELPA + SCALAPACK:
     Called by main_loop:
     move_electro :    680.77s CPU    680.77s WALL (     200 calls)
     ortho        :    279.69s CPU    279.69s WALL (     201 calls)
     updatc       :     46.45s CPU     46.46s WALL (     201 calls)
     strucf       :      0.25s CPU      0.25s WALL (       1 calls)
     calbec       :      7.65s CPU      7.65s WALL (     202 calls)
     rsg          :    105.16s CPU    105.16s WALL (     201 calls)

It would be very nice to have other data...

Ciao,

Fabio

Fabio Affinito, PhD
SuperComputing Applications and Innovation Department
CINECA - via Magnanelli, 6/3, 40033 Casalecchio di Reno (Bologna) - ITALY
Tel: +39 051 6171794  Fax: +39 051 6132198

----- Original Message -----
> From: "Ivan Girotto" <igirotto at ictp.it>
> To: "General discussion list for Quantum ESPRESSO developers" <q-e-developers at qe-forge.org>
> Sent: Monday, October 1, 2012 2:03:05 PM
> Subject: Re: [Q-e-developers] diagonalization with multithreads libs slow: comparison between Jade (cines) and Fermi
> (cineca)
> 
> 
> Hi Layla,
> 
> I have never tried with 1thread as it's not recommended on BG/Q. At
> least 2threads x MPI process.
> On the other hand I have some doubt about the way you are running the
> jobs? How big is the BG size? How many processes are you running per
> node in each of the 2 cases?
> 
> I have some doubt about the question itself too. You are saying that
> you see a slow down comparing 4Threads Vs 1 Threads. But from the
> table below you report only data with 4 threads for BG/Q.
> Have you perhaps switched the headers of the two columns Threads and
> Ndiag?
> 
> It's expected that the SGI architecture based on Intel processors is
> faster than BG/Q.
> 
> Ivan
> 
> On 01/10/2012 13:17, Layla Martin-Samos wrote:
> 
> Dear all, I have made some test calculations on Fermi and Jade, for a
> 107 atoms system, 70 Ry cutoff for wfc, 285 occupied bands and 1
> kpoint. What the results seems to show is that the diagonalization
> with multithreads lib seems to considerably slowdown the
> diagonalization time (diaghd is called 33 times on all the jobs and
> the final results are identical). The compiled cineca version gives
> identical time and results than 5.0.1. Note that jade in sequential
> is faster than BGQ. I am continuing some other tests on jade,
> unfortunatelly the runs stay a lot of time in the queue, the machine
> is full and even for a 10 min job with 32 cores you wait more than 3
> hours. As attachement I put the two make.sys for jade.
> 
> 
> omputer mpi process threads ndiag complex/gamma_only time for diaghg
> version Libs
> 
> bgq 128 4 1 complex (cdiaghg) 69.28 s 5.0.1 threads
> bgq 128 4 1 complex (cdiaghg) 69.14 s 4.3.2 threads
> 
> jade 32 1 1 complex (cdiaghg) 27.44 s 4.3.2 sequential
> jade 32 1 1 complex (cdiaghg) > 10 min 4.3.2 threads
> jade 32 1 1 complex (cdiaghg) > 10 min 5.0.1 threads
> 
> 
> bgq 128 4 4 complex (cdiaghg) 310.52 s 5.0.1 threads
> 
> bgq 128 4 4 gamma (rdiaghg) 73.87 s 5.0.1 threads
> bgq 128 4 4 gamma (rdiaghg) 73.71 s 4.3.2 threads
> 
> bgq 128 4 1 gamma (rdiaghg) CRASH 2 it 5.0.1 threads
> bgq 128 4 1 gamma (rdiaghg) CRASH 2 it 4.3.2 threads
> 
> 
> did someone observe a similar behavior?
> 
> cheers
> 
> Layla
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Q-e-developers mailing list Q-e-developers at qe-forge.org
> http://qe-forge.org/mailman/listinfo/q-e-developers
> --
> 
> Ivan Girotto - igirotto at ictp.it High Performance Computing Specialist
> Information & Communication Technology Section
> The Abdus Salam - www.ictp.it International Centre for Theoretical
> Physics
> Strada Costiera, 11 - 34151 Trieste - IT
> Tel +39.040.2240.484
> Fax +39.040.2240.249
> _______________________________________________
> Q-e-developers mailing list
> Q-e-developers at qe-forge.org
> http://qe-forge.org/mailman/listinfo/q-e-developers
>