<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">OK, thanks.<div>It is as Paolo suggested: some parts of the code are not parallelized over k-points. This said, I don't see why these parts of the code show a much larger timing when going from the serial run to the run with k-point parallelization. For instance, I would expect that newd has the same timing more or less, but it takes 3 times more. Also the timing of sum_band looks quite higher. Maybe there is a communication overhead between MPI processes, or the memory contention problem that was mentioned before (since the volume of data between cores and main memory largely increases when passing from 1 to 6 pools).</div><div><br></div><div><br></div><div><br></div><div>GS</div><div><br></div><div><br></div><div><div><div>Il giorno 15/feb/2011, alle ore 11.37, Davide Sangalli ha scritto:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Dear Paolo and Gabriele,<br>thanks a lot for all your comments.<br><br>For Gabriele, in case you are still interested, I post the details of my <br>calculations.<br><br>Best regards and thank you again,<br>Davide<br><br>****************************************************************<br>TEST 1: Serial run<br> init_run : 24.83s CPU 25.13s WALL ( 1 calls)<br> electrons : 349.01s CPU 351.40s WALL ( 1 calls)<br> forces : 17.99s CPU 18.04s WALL ( 1 calls)<br> stress : 44.14s CPU 44.30s WALL ( 1 calls)<br><br> Called by init_run:<br> wfcinit : 10.50s CPU 10.64s WALL ( 1 calls)<br> potinit : 1.93s CPU 1.97s WALL ( 1 calls)<br><br> Called by electrons:<br> c_bands : 209.73s CPU 211.25s WALL ( 10 calls)<br> sum_band : 65.96s CPU 66.35s WALL ( 10 calls)<br> v_of_rho : 8.64s CPU 8.82s WALL ( 11 calls)<br> newd : 70.57s CPU 70.81s WALL ( 11 calls)<br> mix_rho : 0.79s CPU 0.79s WALL ( 10 calls)<br><br> Called by c_bands:<br> init_us_2 : 1.45s CPU 1.46s WALL ( 138 calls)<br> cegterg : 205.73s CPU 206.86s WALL ( 60 calls)<br><br> Called by *egterg:<br> h_psi : 119.93s CPU 119.97s WALL ( 217 calls)<br> s_psi : 24.87s CPU 24.88s WALL ( 217 calls)<br> g_psi : 1.04s CPU 1.03s WALL ( 151 calls)<br> cdiaghg : 3.98s CPU 4.07s WALL ( 211 calls)<br><br> Called by h_psi:<br> add_vuspsi : 24.87s CPU 24.87s WALL ( 217 calls)<br><br> General routines<br> calbec : 39.51s CPU 39.52s WALL ( 289 calls)<br> cft3s : 64.52s CPU 65.52s WALL ( 22216 calls)<br> interpolate : 0.79s CPU 0.79s WALL ( 21 calls)<br> davcio : 0.01s CPU 0.63s WALL ( 198 calls)<br><br> Parallel routines<br><br> PWSCF : 7m16.35s CPU time, 7m19.59s WALL time<br><br>****************************************************************<br>TEST 1: kpts parallelization<br> init_run : 29.99s CPU 30.29s WALL ( 1 calls)<br> electrons : 441.37s CPU 453.52s WALL ( 1 calls)<br> forces : 51.92s CPU 52.91s WALL ( 1 calls)<br> stress : 133.94s CPU 137.38s WALL ( 1 calls)<br><br> Called by init_run:<br> wfcinit : 2.64s CPU 2.68s WALL ( 1 calls)<br> potinit : 1.92s CPU 2.02s WALL ( 1 calls)<br><br> Called by electrons:<br> c_bands : 40.54s CPU 42.66s WALL ( 10 calls)<br> sum_band : 177.87s CPU 182.15s WALL ( 10 calls)<br> v_of_rho : 11.17s CPU 11.74s WALL ( 11 calls)<br> newd : 228.49s CPU 229.61s WALL ( 11 calls)<br> mix_rho : 2.67s CPU 2.68s WALL ( 10 calls)<br><br> Called by c_bands:<br> init_us_2 : 0.64s CPU 0.68s WALL ( 21 calls)<br> cegterg : 39.15s CPU 40.36s WALL ( 10 calls)<br><br> Called by *egterg:<br> h_psi : 34.15s CPU 34.19s WALL ( 37 calls)<br> s_psi : 1.64s CPU 1.64s WALL ( 37 calls)<br> g_psi : 0.22s CPU 0.22s WALL ( 26 calls)<br> cdiaghg : 0.48s CPU 0.48s WALL ( 36 calls)<br><br> Called by h_psi:<br> add_vuspsi : 1.67s CPU 1.67s WALL ( 37 calls)<br><br> General routines<br> calbec : 2.83s CPU 2.83s WALL ( 49 calls)<br> cft3s : 25.51s CPU 25.77s WALL ( 3904 calls)<br> interpolate : 1.57s CPU 1.58s WALL ( 21 calls)<br> davcio : 0.00s CPU 0.09s WALL ( 10 calls)<br><br> Parallel routines<br><br> PWSCF : 10m57.44s CPU time, 11m14.40s WALL time<br><br>****************************************************************<br>TEST 1: FFT parallelization<br><br> init_run : 7.12s CPU 8.04s WALL ( 1 calls)<br> electrons : 71.85s CPU 77.28s WALL ( 1 calls)<br> forces : 8.49s CPU 8.68s WALL ( 1 calls)<br> stress : 21.95s CPU 22.46s WALL ( 1 calls)<br><br> Called by init_run:<br> wfcinit : 1.61s CPU 2.06s WALL ( 1 calls)<br> potinit : 0.74s CPU 0.79s WALL ( 1 calls)<br><br> Called by electrons:<br> c_bands : 35.48s CPU 38.71s WALL ( 11 calls)<br> sum_band : 16.47s CPU 17.71s WALL ( 11 calls)<br> v_of_rho : 2.59s CPU 2.75s WALL ( 12 calls)<br> newd : 18.12s CPU 18.81s WALL ( 12 calls)<br> mix_rho : 0.42s CPU 0.44s WALL ( 11 calls)<br><br> Called by c_bands:<br> init_us_2 : 0.65s CPU 0.66s WALL ( 150 calls)<br> cegterg : 34.41s CPU 37.31s WALL ( 66 calls)<br><br> Called by *egterg:<br> h_psi : 23.01s CPU 25.34s WALL ( 239 calls)<br> s_psi : 1.95s CPU 1.94s WALL ( 239 calls)<br> g_psi : 0.23s CPU 0.23s WALL ( 167 calls)<br> cdiaghg : 2.90s CPU 3.18s WALL ( 233 calls)<br><br> Called by h_psi:<br> add_vuspsi : 1.91s CPU 1.91s WALL ( 239 calls)<br><br> General routines<br> calbec : 3.54s CPU 3.81s WALL ( 317 calls)<br> cft3s : 12.24s CPU 15.25s WALL ( 24298 calls)<br> interpolate : 0.35s CPU 0.37s WALL ( 23 calls)<br> davcio : 0.00s CPU 0.54s WALL ( 216 calls)<br><br> Parallel routines<br> fft_scatter : 4.34s CPU 6.95s WALL ( 24298 calls)<br><br> PWSCF : 1m49.61s CPU time, 1m56.75s WALL time<br><br><br><br><br>On 02/14/2011 06:22 PM, Paolo Giannozzi wrote:<br><blockquote type="cite">Also notice that parallelization on k-points has (in principle)<br></blockquote><blockquote type="cite">a linear speedup on the diagonalization of H and related operations<br></blockquote><blockquote type="cite">depending on the number of k-points, but not for other operations<br></blockquote><blockquote type="cite">depending upon the charge density such as calculation of V[n(r)].<br></blockquote><blockquote type="cite">The latter are typically small in comparison with the former, but<br></blockquote><blockquote type="cite">it depends a lot upon the specific system. FFT parallelization<br></blockquote><blockquote type="cite">distributes both calculations (and yes, it distributes most memory,<br></blockquote><blockquote type="cite">I stand by my statement)<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">P.<br></blockquote><br>Davide Sangalli<br>MDM Lab, IMM, CNR<br>Agrate (MI), Italy<br>_______________________________________________<br>Pw_forum mailing list<br><a href="mailto:Pw_forum@pwscf.org">Pw_forum@pwscf.org</a><br>http://www.democritos.it/mailman/listinfo/pw_forum<br></div></blockquote></div><br><div>
<span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div><span class="Apple-style-span" style="color: rgb(126, 126, 126); font-size: 16px; font-style: italic; "><br class="Apple-interchange-newline">§ Gabriele Sclauzero, EPFL SB ITP CSEA</span></div><div><font class="Apple-style-span" color="#7E7E7E"><i> PH H2 462, Station 3, CH-1015 Lausanne</i></font></div></span>
</div>
<br></div></body></html>