Error handling feature - openucx/ucx GitHub Wiki

Arch overview

While currently most errors indicate the source of the problem, it is currently unclear what is the consequence of each error - with regard to future operations. It means most errors are de-facto fatal as further use of the library is not allowed.

The proposed solution is in the form of an error value convention, designed to make clear what is the status of the subject following this error. This takes into account the possibility of a future error-recovery feature for UCX, doing some automatic internal error-handling flow based on the error value.

The errors will be split into two levels:

Interface example errors:

Link failure (is down)
Device failure (assertion fail, verbs error)

Endpoint example errors:

Remote disconnected
timeout

Each error type will have a reserved range of error values, e.g. UCS_ERR_GENERAL_LINK_FAILURE-UCS_ERR_LAST_LINK_FAILURE. In the case of interface failure - all underlying endpoints must fail on invocation from this moment on until the problem is solved. This may be achieved by replacing relevant function pointers to point to a different one – returning an error without doing anything else. Every messages sent before performing a “flush” is assumed to be sent successfully.

Open issues

There are a few alternatives for the notification of errors (may also relate to the Interrupt notification):

Calling a function will return a value corresponding to an error type
Add error parameter to callbacks used for completion notification
Add a separate callback for specific error types
Allow the setting of a global error handler for any error event

The suggested course of implementation is the latter, but there should be a discussion to make sure all relevant needs are met.

Implementation plan

Below are the items planned for implementing the feature:

UCT error handling: Make sure every error is propogated to the user. In the case of async calls - a future call to flush will return the error. In user callbacks - ucs_status_t will be added to reflect possibility of an error.
UCP error handling: Basically same as #1, only in UCP, based on UCT-returned codes.
UCT internal error-setting: add something like ep_set_error(error_code) which sets most of the function pointers to point to a stub returning this error_code from now on. It shouldn't set unrelated functions, or ep_destroy() call.