[Pw_forum] ph.x: Avoiding the recalculation of the band structure in distributed phonon dispersion jobs

Andrea Dal Corso dalcorso at sissa.it
Tue Feb 26 15:18:35 CET 2013


On Tue, 2013-02-26 at 11:00 +0000, Karttunen Antti wrote:
> Dear Andrea,
> 
> I tried your new only_init keyword by running the new GRID_example/run_example_3 and with my own test jobs. It worked great, but only after I sorted out one really strange issue. The example was crashing for apparently random (q,irr) combinations right in the beginning of ph.x. The error occurred at subroutine check_directory_phsave and it was coming from the following call:
> 
> CALL iotk_open_read(iunout, FILE = TRIM(filename1), &
>                                         BINARY = .FALSE., IERR = ierr )
> which leads to:
> IF (ierr /= 0) CALL errore('check_directory_phsave','opening file',1)
> 
> ierr from iotk was always 2.
> 
> I did tried to see, which file caused the problem and it was always dynmat.1.0.xml (maybe because it's the first file that is read in in the loop). Now, the strange thing was that this occurred randomly: for about 20% of the (q,irr) combinations within the example. And the crashing combinations changed between runs. Furthermore, my own tests showed that that even if some (q,irr) combination crashed, ph.x would run successfully if I re-ran the same input several times (something like 5 times). I'm running the tests on a local filesystem, so it's not an NFS issue. I guess it could be a compiler issue with iotk (I used ifort 12.1.5), but at least compiling iotk without any optimization flags did not help. I'm really puzzled by the (apparently) random nature of the crashes, I could not figure out what is the non-deterministic factor here (I sure hope it's not my hard drive...).
> 

Thank you for your help in identifying bugs. Now I have commited a bug
fix to check_directory_phsave. 

> In any case, I could avoid the crashes by replacing the
> IF (ierr /= 0 ) GOTO 100
> with 
> IF (ierr /= 0 ) CYCLE
> in check_directory_phsave and uncommenting the CALL errore after the loop. I think this is also how the things were done before code revision 9858. We were previously using revision 9772 and this problem never occurred there (maybe the problem is unrelated to only_init and just surfaced now when I changed the revision?).
> 
> It would be interesting to know what was causing the random iotk_open_read errors. Furthermore, maybe the whole loop over nqs and irr_iq in check_directory_phsave could be skipped for cases where start_q=last_q and start_irr=last_irr? This is the normal case for grid jobs and for large systems this could help to avoid hundreds of filesystem operations for every job (since we are going to do just one (q,irr), it's not that interesting whether the other ones have been done or not). But maybe this would have some side-effects, so at the moment I'll just use the above trick.
> 
OK, I will see if I can do something for this problem.

Andrea

> Best wishes,
> Antti
> 
> -- 
> Dr. Antti Karttunen
> Department of Chemistry
> University of Jyväskylä, Finland
> Tel: +358-50-3473475
> WWW: http://www.iki.fi/ankarttu 
> 
> 
> -----Original Message-----
> From: pw_forum-bounces at pwscf.org [mailto:pw_forum-bounces at pwscf.org] On Behalf Of Andrea Dal Corso
> Sent: Monday, February 25, 2013 7:36 PM
> To: PWSCF Forum
> Subject: Re: [Pw_forum] ph.x: Avoiding the recalculation of the band structure in distributed phonon dispersion jobs
> 
> 
> On Mon, 2013-02-11 at 18:30 +0000, Karttunen Antti wrote:
> > Dear all, 
> > 
> > We are using the start_q/last_q and start_irr/last_irr keywords to execute phonon dispersion jobs within a HPC grid service. The scheme works really nicely and we are able to run fairly large phonon dispersion calculations very efficiently. However it would be great to know if we could further increase the efficiency by avoiding the recalculation of the band structure at all irreps for every q.
> > 
> > A concrete example: We are using a 4x4x4 q-point grid to investigate the phonon dispersions of cubic silicon clathrate (FCC structure with 34 atoms in the primitive cell),requiring the calculation of 8 q-points in total. While the number of symmetry-independent q-points is rather low, the individual q-points can contain as many as 101 irreps (558 (q,irrep) calculations in total). While in "normal" phonon dispersion calculations the band structure is solved once for every q, in the distributed phonon dispersion calculations every single (q,irrep) job calculates the band structure before doing the actual phonon calculation (except q=1). So, the band structure is "re-calculated" numerous times in the distributed scheme. The overhead is not negligible: For a single (q,irrep) job at the q-points with the lowest symmetry,  the band structure calculation can typically take ~10 CPU hours of the total execution time of ~60 CPU hours (we are running the jobs in the grid with just o
 
 ne
>  
>   CPU).
> > 
> > For systems like this, it would be really great if we could do something like this:
> > 1) Precalculate the band structure for every q (for example, for irrep=1),
> > 2) Write the results of the band structure calculation to a file for every q 
> > 3) For all other irreps, just read the precalculated band structure from the file.
> > 
> > We are already using a similar scheme to avoid the re-calculation of the dielectric constant for all q=1 irreps:
> > 1) Precalculate the dielectric constant for (q=1,irrep=1)
> > 2) Use data-file.1.xml with DIELECTRIC_CONSTANT and EFFECTIVE_CHARGES as the starting point for other q=1 irreps.
> > 3) With recover=.true., the re-calculation of the dielectric constant is avoided
> > 
> > However, we have not been able to devise a similar scheme to avoid the re-calculation of the band structure for q>1. I've been reading the source code but at least based on check_initial_status.f90 it seems that reading the bands is only possible if there is a restart file available (i.e. the calculation has been interrupted).  So, while the built-in logic supports restarting "normal" phonon dispersion calculations, we haven't been able to find out a way to read the band structure into a single (q,irrep) job. 
> > 
> 
> I thought that this procedure was already working if you copied all the
> files produced by ph.x using start_irr=1 and last_irr=1 as a preparatory
> run, but there were still some problems. I commited a script in the SVN
> version inspired to your suggestion (see GRID_example/run_example_3).
> Hopefully you can adapt it to your cases. 
> 
> HTH,
> 
> Andrea
> 
> 
> 
> 
> > We would really appreciate any comments or ideas on how to avoid the overhead from the band structure calculations in the above scenario.
> > 
> > Best wishes,
> > Antti Karttunen
> > 
-- 
Andrea Dal Corso                    Tel. 0039-040-3787428
SISSA, Via Bonomea 265              Fax. 0039-040-3787249
I-34136 Trieste (Italy)             e-mail: dalcorso at sissa.it





More information about the users mailing list