[Pw_forum] problem with parallel calculation
mbaris at metu.edu.tr
mbaris at metu.edu.tr
Mon Jul 9 07:16:36 CEST 2007
Dear Ezad,
> MPI_Recv: message truncated (rank 0, comm 9)Rank (0,
> MPI_COMM_WORLD): Call stack within LAM:Rank (0, MPI_COMM_WORLD): -
> MPI_Recv()Rank (0, MPI_COMM_WORLD): - main()
> how can i fix this?
>
This indicates that your main node received a message with different
size than expected (from node 9?), which shouldn't happen. As I can
see, you are using LAM. It is hard to locate the exact source of this
kind of error (especially when using LAM), but here are some quick tips:
1) Are you sure your nodes are initialized properly, are there anything
in your nodes (firewall etc.) that may be appending extra output? 2)
Are you sure the ports LAM use are assigned solely for LAM (ex. look
for /etc/services) 3) Are all nodes executing the same code (i.e. when
you are not using shared storage)? 4) Are all your nodes homogeneous?
(i.e. same kernel,glibc,libraries similar hardware)
5) One of your daemons may have died prematurely, check for cpu quotas,
or your particular calculation may have caused something like a
segmentation fault in one of the nodes.
Hope this helps, O. Baris Malcioglu
METU
Ankara
More information about the users
mailing list