[Pw_forum] wrong record length

Marci vormar at gmail.com
Mon Mar 2 21:05:14 CET 2009


Hi Axel,

> marton,
>
> are you trying to run the postprocessing on your local
> machine or on the IBM machine?

on the IBM machine. I had bad experiences with postprocessing on a
different machine because of using the iotk package, converting binary
files to text files and back is quite time consuming... (and I hate
ssh-ing gygabites of files)

> that depends on what is causing this. it could just be that you
> have an integer overflow, due to the size of your system, or it
> could be that you try to read unformatted data on a different
> endian machine. i would suggest you insert a print statment into
> the code that prints out the values of DIRECT_IO_FACTOR and recl
> as well as unf_recl and then get back to use with the information
> about the architectures and these numbers (ideally also for the
> smaller test, where it worked).

Unfortunately, the espresso I'm using on BASSI was not compiled by
myself, and now I'm scared of compiling mine because I'm not sure that
it will be able to read the binary that was made with an espresso
probably compiled with different compilers and/or compiler options.
Yeah, I know... I should have compiled my own version of quantum
espresso before making serious calculations to avoid these
situtations.

So... I made some changes in diropn.f90 in espresso4.0/PW and compiled
my own version of espresso (with this I get the same error) to print
the values below in the case of the big run, honestly I do not really
know much about this cluster, but I'm sure I'm using compiler xl
fortran version 11.1.0.3 and library essl 4.2.0.3.

recl:  415578000
DIRECT_IO_FACTOR:          8
unf_recl: -970343296

On my home cluster, I used a parallelized espresso-4.0.3 on system
"Intel Xeon E5410 @ 2.33Ghz, 16 GB RAM" with ifort 10.1.015, intel mkl
libraries 10.0.1.014 and openmpi-1.2.6 and with a smaller but similar
system (same pseudos, same cutoff, only gamma point), as I said there
is no "wrong record length" error and I got the following values:

recl:   97079200
DIRECT_IO_FACTOR:          8
unf_recl:  776633600

If I'm right... 415578000*8 = 3324624000 which is bigger than the
largest value of a signed 32 bit integer, maybe that causes the
problem?

Thanks for your help,
Marton



More information about the users mailing list