[Pw_forum] problem with parallel calculation

mbaris at metu.edu.tr mbaris at metu.edu.tr
Mon Jul 9 07:16:36 CEST 2007


Dear Ezad,

> MPI_Recv: message truncated (rank 0, comm 9)Rank (0,
> MPI_COMM_WORLD): Call stack within LAM:Rank (0, MPI_COMM_WORLD):  -
> MPI_Recv()Rank (0, MPI_COMM_WORLD):  - main()
> how can i fix this?
>


This indicates that your main node received a message with different 
size than expected (from node 9?), which shouldn't happen. As I can 
see, you are using LAM. It is hard to locate the exact source of this 
kind of error (especially when using LAM), but here are some quick tips:

1) Are you sure your nodes are initialized properly, are there anything 
in your nodes (firewall etc.) that may be appending extra output?    2) 
Are you sure the ports LAM use are assigned solely for LAM (ex. look 
for /etc/services) 3) Are all nodes executing the same code (i.e. when 
you are not using shared storage)? 4) Are all your nodes homogeneous? 
(i.e. same kernel,glibc,libraries similar hardware)
5) One of your daemons may have died prematurely, check for cpu quotas, 
or your particular calculation may have caused something like a 
segmentation fault in one of the nodes.

Hope this helps, O. Baris Malcioglu
METU
Ankara







More information about the users mailing list