AC> Dear Forum Members,
AC>      We have encountered an Internal MPI error in a couple of jobs. The jobs
AC> are of medium size (around 60 atoms in the unit cell) . When they had run
AC> after some time (one or two days) they were automatically killed due to the
AC> following error:
AC> Fatal error in MPI_Alltoallv: Internal MPI error!, error stack:
AC> MPI_Alltoallv(407).: MPI_Alltoallv(sbuf=0x2a96197010, scnts=0x7fbfffdaf0,
AC> sdispls=0x7fbfffdb70, MPI_DOUBLE_COMPLEX, rbuf=0x2ad32ce010,
AC> rcnts=0x7fbfffdbf0, rdispls=0x7fbfffdc70, MPI_DOUBLE_COMPLEX,
AC> comm=0x84000002) failed
AC> MPI_Waitall(242)..........................: MPI_Waitall(count=64,
AC> req_array=0x3391b40, status_array=0x3391630) failed
AC> MPIDI_CH3_Progress_wait(212)..............: an error occurred while handling
AC> an event returned by MPIDU_Sock_Wait()
AC> MPIDI_CH3I_Progress_handle_sock_event(413):

AC> MPIDU_Socki_handle_read(633)..............: connection failure
AC> (set=0,sock=8,errno=104:Connection reset by peer)

this looks like an ethernet timeout. have you checked whether
your switch can handle the load that the job creates?
i assume you are running on top of gigabit ethernet.

have you tried the same input on a different machine 
or serially? what version of the code are you using?

please keep in mind, that the more specifics you can
provide about what happened, the better the help people
can give you.


AC> Does anyone have this problem before? How can I avoid this problem?
AC> Thank you very much.
AC> Hanghui

