[Pw_forum] error on XD1 platform

Andrea Ferretti ferretti.andrea at unimore.it
Fri Mar 16 00:23:50 CET 2007


Hi all,

>  Dear All,
>
>   I have a problem with running PW-3.2 on Cray XD1 platform. Sometimes 
it works,
>   but very often it crashes. It doesn't matter which version of PGI
>   compiler is used (6.1.1, 6.1.4 or 7.0-2). The code was linked with
>   ACML library and FFTW from QE distribution.
>

I got problems which sound similar 
running as well on a Cray XD1 machine (PGI compiler 6.0)... 
I used both espresso-3.2 and the CVS version (around one 
month ago) and while for some calculations they worked, for some others 
(particularly heavy) sometimes they worked, sometimes not. 
(and the crashes were either at the beginning of the calculation, 
either after a number of scf iterations).

first, I discovered (or better, someone told me about) the existence in 
espresso of a precompiler __XD1 flag which should be intended to fix some 
troubles in the MPI stuff (troubles with PGI, I guess)... 
I don't understand the errors you got, but since they seem to be 
somehow related to MPI, I would try to add the flag
-D__XD1 
to DFLAGS and FDFLAGS in the make.sys file

in my case however, 
when I tried with __XD1 it didn't solve my problem... 
but after some feedback from the 
administrator, it seemed it was due to some instabilities of the 
system (especially related to management of memory)

cheers
andrea


> 
>  Dear All,
> 
>   I have a problem with running PW-3.2 on Cray XD1 platform. Sometimes it works, 
>   but very often it crashes. It doesn't matter which version of PGI 
>   compiler is used (6.1.1, 6.1.4 or 7.0-2). The code was linked with 
>   ACML library and FFTW from QE distribution.
> 
>   The problem always happens after several ionic steps or during scf 
>   cyles. For example, I see the following in the output file:
> 
> .....
> Writing output data file XXX.save
> Process 0 lost connection: exiting
> mpiexec: Error: read_rai_startup_ports: Failed to read barrier entry token from rank 1 process on c645n2.
> 
> Process 38 lost connection: exiting
>  ask 128 got 56  at line 863 in file /var/tmp/mpich-1.2.6/mpid/rai/raifma.cPProcess 16 lost connection: exiting
> 
> 

--
Andrea Ferretti
National Research Center S3, CNR-INFM  ( http://s3.infm.it )
Dipartimento di Fisica, Universita' di Modena e Reggio Emilia
Via Campi 213/A I-41100 Modena, Italy
Tel:     +39 059 2055301      Fax:     +39 059 374794
Skype:   andrea_ferretti
URL:     http://www.nanoscience.unimo.it

Please, if possible, don't send me MS Word or PowerPoint attachments
Why? See:  http://www.gnu.org/philosophy/no-word-attachments.html



More information about the users mailing list