[Pw_forum] Internal MPI error
akohlmey at cmm.chem.upenn.edu
Wed Oct 3 01:22:55 CEST 2007
On Tue, 2 Oct 2007, alan chen wrote:
AC> Dear Forum Members,
AC> We have encountered an Internal MPI error in a couple of jobs. The jobs
AC> are of medium size (around 60 atoms in the unit cell) . When they had run
AC> after some time (one or two days) they were automatically killed due to the
AC> following error:
AC> Fatal error in MPI_Alltoallv: Internal MPI error!, error stack:
AC> MPI_Alltoallv(407).: MPI_Alltoallv(sbuf=0x2a96197010, scnts=0x7fbfffdaf0,
AC> sdispls=0x7fbfffdb70, MPI_DOUBLE_COMPLEX, rbuf=0x2ad32ce010,
AC> rcnts=0x7fbfffdbf0, rdispls=0x7fbfffdc70, MPI_DOUBLE_COMPLEX,
AC> comm=0x84000002) failed
AC> MPI_Waitall(242)..........................: MPI_Waitall(count=64,
AC> req_array=0x3391b40, status_array=0x3391630) failed
AC> MPIDI_CH3_Progress_wait(212)..............: an error occurred while handling
AC> an event returned by MPIDU_Sock_Wait()
AC> MPIDU_Socki_handle_read(633)..............: connection failure
AC> (set=0,sock=8,errno=104:Connection reset by peer)
this looks like an ethernet timeout. have you checked whether
your switch can handle the load that the job creates?
i assume you are running on top of gigabit ethernet.
have you tried the same input on a different machine
or serially? what version of the code are you using?
please keep in mind, that the more specifics you can
provide about what happened, the better the help people
can give you.
AC> Does anyone have this problem before? How can I avoid this problem?
AC> Thank you very much.
Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu http://www.cmm.upenn.edu
Center for Molecular Modeling -- University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
If you make something idiot-proof, the universe creates a better idiot.
More information about the users