[Pw_forum] wrong record length

Mon Mar 2 22:10:16 CET 2009

On Mon, 2 Mar 2009, Marci wrote:

MV> Hi Axel,
MV> 
MV> > marton,
MV> >
MV> > are you trying to run the postprocessing on your local
MV> > machine or on the IBM machine?
MV> 
MV> on the IBM machine. I had bad experiences with postprocessing on a
MV> different machine because of using the iotk package, converting binary
MV> files to text files and back is quite time consuming... (and I hate
MV> ssh-ing gygabites of files)

just checking. actually, there are ways to make fortran read IEEE-754 
compliant binary floating point numbers on different endian hardware,
but i never checked whether iotk can handle this as well.

[...]

MV> Unfortunately, the espresso I'm using on BASSI was not compiled by
MV> myself, and now I'm scared of compiling mine because I'm not sure that
MV> it will be able to read the binary that was made with an espresso
MV> probably compiled with different compilers and/or compiler options.

there is a big difference between linux and non-linux machines.
on linux there is a zoo of compilers and math libraries and 
there are all kinds of subtle compatibility issues. on AIX 
or other "commercial" platforms, this is generally less of an
issue, only that it is not as easy to replace one compiler by
another, in case the system provided compiler is broken.

MV> Yeah, I know... I should have compiled my own version of quantum
MV> espresso before making serious calculations to avoid these
MV> situtations.
MV> 
MV> So... I made some changes in diropn.f90 in espresso4.0/PW and compiled
MV> my own version of espresso (with this I get the same error) to print
MV> the values below in the case of the big run, honestly I do not really
MV> know much about this cluster, but I'm sure I'm using compiler xl
MV> fortran version 11.1.0.3 and library essl 4.2.0.3.

that is fine. 

MV> 
MV> recl:  415578000
MV> DIRECT_IO_FACTOR:          8
MV> unf_recl: -970343296

bingo!  this is your problem. 8x415578000 is larger than 2^31, 
so unf_recl defined as integer*4 will overflow.

MV> On my home cluster, I used a parallelized espresso-4.0.3 on system
MV> "Intel Xeon E5410 @ 2.33Ghz, 16 GB RAM" with ifort 10.1.015, intel mkl
MV> libraries 10.0.1.014 and openmpi-1.2.6 and with a smaller but similar
MV> system (same pseudos, same cutoff, only gamma point), as I said there
MV> is no "wrong record length" error and I got the following values:
MV> 
MV> recl:   97079200
MV> DIRECT_IO_FACTOR:          8
MV> unf_recl:  776633600
MV> 
MV> If I'm right... 415578000*8 = 3324624000 which is bigger than the
MV> largest value of a signed 32 bit integer, maybe that causes the
MV> problem?

exactly. 

the interesting question is now, how to work around this problem.
you could try and declare unf_recl as integer*8 and try to recompile.
perhaps, even just removing the test for negative unf_recl might work,
but i doubt it. 

good luck,
   axel.

MV> Thanks for your help,
MV> Marton
MV> 

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.