All uct/ucp functions to be added a return code, that contains an error code when relevant.
Alternative/additional design is to have a callback for error cases, triggered from within the erroneous function. Not going this way for now, the loss of context makes it harder to use in low-level programming (as an example, one needs a setjmp/longjmp to exit the erroneous code path if the error is captured only by a CB, w/o a return code, not yummy).
Errors are reported per-operation.
When an error is reported on an endpoint, that endpoint can be marked as problematic, and should then be disposed off.
Other endpoints are unaffected.
It may or may not be possible to reconnect to the target process, or to use another transport/endpoint to reach that process (UCT should not do failover, UCP may).
When an operation reports an error, the destination buffer is undefined (that is the local buffer in a get, the remote buffer in a put/amo)
queuing and matching
UCP may failover and try other UCT transports to complete the operation (probably want to report the performance error, and have a way to disable failover altogether from the userland)
When getting an UCP error: the endpoint is in error state
When an endpoint is in error state, we should stop the matching
Add an UCP function to resume the matching: the up-layer has to decide if currently pending matching order is still making sense (if an ANY_SOURCE operation is in the matching, it is possible we should interrupt everything).
If not, add an UCP function to shutdown the endpoint and purge the pending/matching queue and unexpected frag from all messages relating to that endpoint.
error types
generally, up-layer is responsible for determining if the UCP error is a link or process error. However, if the transport provides some introspection capabilities, more precise errors can be generated.
In most cases, and in general it is hard to determine if a remote peer has actually failed, or has just become disconnected (HCA error, out of credits, link-wire switch issue, ...). So in general, UCT functions are expected to return error codes about "UNREACHABLE"
Some UCT errors are temporary (errors from UCP should be only the non-correctable kind), and may be corrected otherwise (like rebooting the HCA), those errors should have separate codes indicating the intended remediation.
Error code mockup list
UCT_ERR_UNREACHABLE: generic code, the target is unreachable
UCT_ERR_LNIC_FAILED: the local NIC has failed (uncorrectable)
UCT_ERR_LNIC_REBOOT: the local NIC has failed (correctable, need to re-init the transport)
UCT_ERR_RNIC_FAILED: the remote NIC has failed (uncorrectable)
UCT_ERR_RNIC_REBOOT: the remote NIC has failed (correctable, need to re-init the transport)
UCT_ERR_ROUTE_LOST: the switching infrastructure cannot route to the target
UCT_ERR_PROC_FAILED: the target process has failed (may never be returned for some transports)