[Pw_forum] segmentation fault

Iurii Timrov itimrov at sissa.it
Mon Aug 17 12:38:39 CEST 2015


Dear Vishal Gupta,

Which version of Quantum ESPRESSO do you use?

Do you use a SVN version downloaded later than July, 6th 2015 (revision 
 > r11608)? If yes, then you may have the same problem as I do. My problem is due to the commit on July, 6th (r11608), and it occurs when I run QE on FERMI @CINECA (BlueGene/Q Architecture, Italian HPC) with 2048 cores (there is no problem with 1024 cores).

http://qe-forge.org/gf/project/q-e/scmsvn/?action=browse&path=%2Ftrunk%2Fespresso%2FModules%2Fmp_world.f90&r1=11607&r2=11608

At the very beginning of the run, there is a message:

4466173:ibm.runjob.client.Job: terminated by signal 11
4466173:ibm.runjob.client.Job: abnormal termination by signal 11 from 
rank 2031

and the code crashes without producing any output. However, the problem 
didn't occur on other HPC's I use.

I have solved the problem by going back to a revision 11607, which 
implied changes in the routine Modules/mp_world.f90 by going back from

CALL MPI_Init_thread(MPI_THREAD_MULTIPLE, PROVIDED, ierr)

to

CALL mpi_init_thread(MPI_THREAD_FUNNELED,PROVIDED,ierr)

You may also try to do all needed changes in mp_world.f90 and test the 
code again.

HTH

Best regards,
Iurii Timrov


Postdoctoral Researcher
SISSA - International School for Advanced Studies
Condensed Matter Sector
Via Bonomea n. 265,
Trieste 34151, Italy


On 2015-08-14 19:32, Axel Kohlmeyer wrote:
> On Fri, Aug 14, 2015 at 1:20 PM, Vishal Gupta 
> <vishal.gupta at iitrpr.ac.in> wrote:
>> Sorry, I should've mentioned.
>> I asked them but they said there might be something wrong with the QE 
>> input
>> file. If that was the case, the file shouldn't have been running fine 
>> with 7
>> processors but it is. Could there really be something wrong with the 
>> input
>> file ?
> 
> sysadmins often say this, so they don't have to check it out, or when
> they don't know what they are doing. if they *know* that there is
> something wrong with the input, then they should at the very least
> tell you what it is.
> 
> but i agree that if it works with less processors, it should work with
> more. unless you are using some very unusual settings when launching
> the job. more likely is that you are running out of memory on the
> machine or are hitting a stack size limit or something similar. your
> system manager(s) should be able to figure this out and/or advise you
> how to run that you are using less memory, or with a hybrid MPI plus
> OpenMP parallelization or whatever else is possible on the specific
> machine.
> 
> in any case, it doesn't really sound like a QE problem.
> 
> axel.
> 
> 
>> Sorry if I am asking stupid doubts but I am little new at this.
>> Vishal Gupta
>> B.Tech. 3rd year Mechanical
>> Indian Institute of Technology Ropar
>> Rupnagar (140001), Punjab, India.
>> Email :- vishal.gupta at iitrpr.ac.in
>> 
>> On Fri, Aug 14, 2015 at 10:32 PM, Axel Kohlmeyer <akohlmey at gmail.com> 
>> wrote:
>>> 
>>> On Fri, Aug 14, 2015 at 12:58 PM, Vishal Gupta
>>> <vishal.gupta at iitrpr.ac.in> wrote:
>>> > I've been running an SCF calculation for a fee Ni system on High
>>> > performance
>>> > cluster. The job runs fine with processors 7 or less but it always leads
>>> > to
>>> > segmentation fault if the no of processors exceeds 7.
>>> > The job takes 4-5 days for the run.
>>> > Is there any way to increase the no of processors so that it doesn't
>>> > lead to
>>> > the error ?
>>> > mpirun noticed that process rank 0 with PID 6353 on node c7c exited on
>>> > signal 11 (Segmentation fault).
>>> > or excessive memory leakage.
>>> 
>>> that is really a question your should ask the system manager(s) or
>>> user support people of the machine that you are running on.
>>> 
>>> axel.
>>> 
>>> 
>>> >
>>> > Thank You
>>> > Vishal Gupta
>>> > B.Tech. 3rd year Mechanical
>>> > Indian Institute of Technology Ropar
>>> > Rupnagar (140001), Punjab, India.
>>> > Email :- vishal.gupta at iitrpr.ac.in
>>> >
>>> > _______________________________________________
>>> > Pw_forum mailing list
>>> > Pw_forum at pwscf.org
>>> > http://pwscf.org/mailman/listinfo/pw_forum
>>> 
>>> 
>>> 
>>> --
>>> Dr. Axel Kohlmeyer  akohlmey at gmail.com  http://goo.gl/1wk0
>>> College of Science & Technology, Temple University, Philadelphia PA, 
>>> USA
>>> International Centre for Theoretical Physics, Trieste. Italy.
>>> _______________________________________________
>>> Pw_forum mailing list
>>> Pw_forum at pwscf.org
>>> http://pwscf.org/mailman/listinfo/pw_forum
>> 
>> 
>> 
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://pwscf.org/mailman/listinfo/pw_forum






More information about the users mailing list