Error handling - openucx/ucx GitHub Wiki

General error handling design

error reporting

  • All uct/ucp functions to be added a return code, that contains an error code when relevant.
    • Alternative/additional design is to have a callback for error cases, triggered from within the erroneous function. Not going this way for now, the loss of context makes it harder to use in low-level programming (as an example, one needs a setjmp/longjmp to exit the erroneous code path if the error is captured only by a CB, w/o a return code, not yummy).
  • Errors are reported per-operation.
    • When an error is reported on an endpoint, that endpoint can be marked as problematic, and should then be disposed off.
    • Other endpoints are unaffected.
    • It may or may not be possible to reconnect to the target process, or to use another transport/endpoint to reach that process (UCT should not do failover, UCP may).
    • When an operation reports an error, the destination buffer is undefined (that is the local buffer in a get, the remote buffer in a put/amo)

queuing and matching

  • UCP may failover and try other UCT transports to complete the operation (probably want to report the performance error, and have a way to disable failover altogether from the userland)
  • When getting an UCP error: the endpoint is in error state
    • When an endpoint is in error state, we should stop the matching
    • Add an UCP function to resume the matching: the up-layer has to decide if currently pending matching order is still making sense (if an ANY_SOURCE operation is in the matching, it is possible we should interrupt everything).
    • If not, add an UCP function to shutdown the endpoint and purge the pending/matching queue and unexpected frag from all messages relating to that endpoint.

error types

  • generally, up-layer is responsible for determining if the UCP error is a link or process error. However, if the transport provides some introspection capabilities, more precise errors can be generated.
  • In most cases, and in general it is hard to determine if a remote peer has actually failed, or has just become disconnected (HCA error, out of credits, link-wire switch issue, ...). So in general, UCT functions are expected to return error codes about "UNREACHABLE"
  • Some UCT errors are temporary (errors from UCP should be only the non-correctable kind), and may be corrected otherwise (like rebooting the HCA), those errors should have separate codes indicating the intended remediation.
  • Error code mockup list
    • UCT_ERR_UNREACHABLE: generic code, the target is unreachable
    • UCT_ERR_LNIC_FAILED: the local NIC has failed (uncorrectable)
    • UCT_ERR_LNIC_REBOOT: the local NIC has failed (correctable, need to re-init the transport)
    • UCT_ERR_RNIC_FAILED: the remote NIC has failed (uncorrectable)
    • UCT_ERR_RNIC_REBOOT: the remote NIC has failed (correctable, need to re-init the transport)
    • UCT_ERR_ROUTE_LOST: the switching infrastructure cannot route to the target
    • UCT_ERR_PROC_FAILED: the target process has failed (may never be returned for some transports)