Checkpoint and restart design - openucx/ucx GitHub Wiki

This is preliminary discussion about checkpoint and restart support in UCX project. Some key points from the design call of Jun-3-2015:

There are two types of checkpoint protocols:

Coordinated
Not coordinated (messages can be cached in order to send later)

It seems that in both cases we need a confirmation that message is delivered to the remote memory. It means that we need an implementation of a strong "flush" where we can provide such guarantee (on protocol level ?). Also, it seem that we need to provide an functionality where we "cancel" the state of protocol and reset it back to the initial state, so the message can be re-started from beginning.

UTK - please update

There is interest in supporting Checkpoint/Restart in UCX.

First note, many transports are in the process of providing address virtualization/relocation services (next OFED, I heard similar noises from Portals people). That might ease one of the major inefficiencies that were faced in previous MPI C/R support, that is, the need to shutdown the transport before taking the checkpoint, restarting it later (with all the cost of re-establishing connexions).

In contrast, we would like to have no disconnect during the checkpoint, and delay cleanup/connexion re-establishement during the restart phase.

Checkpoints in a distributed system are more than an ad-hoc collection of process checkpoints. There needs to be some synchronization for the checkpoint set to be consistent (i.e. one from which a restart makes sense). Most application checkpoints (as well as most system level checkpoints for MPI) are using coordinated checkpoint to build a coherent recovery set. Message logging is an alternative, but it is not as often deployed in production. (read this if you want to dive deeper: http://dl.acm.org/citation.cfm?id=1869385)

Coordinated checkpoint comes in 2 kinds, blocking and non-blocking (more info here: http://dl.acm.org/citation.cfm?id=1188587). In blocking CR, the network is silenced. Once all processes know that the network is silent, they checkpoint. In non-blocking CR, messages are colored (black or white) so that the receiver can detect messages that are crossing the checkpointing line. These in-transit messages are added to the checkpoint and their delivery is delayed (this is the Chandy-Lamport algorithm). Non-blocking CR is in theory lower overhead, but there are implementation caveats that can make it more expensive in practice. It could be worth investigating what we can do in UCX to make non-blocking CR better.

In more technical terms, in coordinated checkpoint, a process can checkpoint safely when it knows that it is not going to receive more messages from a process that has not checkpointed yet. The checkpoint procedure is collective, and is, at macro-scale, a barrier over the processes that are currently connected. A sequence of "quiet_and_remote_notify_to_all()" at all origins can get us there, or alternatively, the process that wants to checkpoint can issue "stop_sending_and_drain_from_all()", which as also the effect of insuring that nobody else is injecting new messages (then a "resume_sending_from_all()" is needed). The later also supports uncoordinated (message logging based) CR better.

During the restart, the process reloaded from checkpoint needs to cleanup the stall state of the communication library (rkeys that are not current, open connexions that are not actually "open" anymore, dangling half received or sent messages, etc). It is to be noted that not all messages can be discarded. Some messages have been completely sent (from the view of the sender), and even though they are still "pending" in the library, they need to be kept and delivered at the target. At the UCT level, since the layer is (almost?) stateless, it should be easy to make it quiet and there is not much stall state to cleanup during the restart. However, it is not as easy for UCP which has advanced rendez-vous protocols, which may need to be interrupted to take the checkpoint (this is a mandatory requirement to insure progress with MPI, because you can post an iSend and postpone posting the wait "later", that being in a very long time and dependent upon receiving another message from somebody else, it should be the same with UCX messages). Then, the receiver may hold interrupted half rendez-vous messages, for which it received the eager, but didn't sent the RDV packed. It may be possible to interrupt this protocol in a well defined way, in which case these operations can just be continued, or it may be necessary to restart completely the full rendez-vous protocol.

Last, we want to remember that we need to support both system and user-directed checkpointing. In user-directed checkpointing, the state of the communication library is not implicitly captured, henceforth we need to provide some interface to dump whatever we need to reinitialize the library state at restart, in the case it's not a system checkpointer that rejuvenate the process.

== Pseudo code ideas ==

uct_flush(ep) <-- we may want to remove the explicit ep to enable overlap between multiple ep ucp_flush_and_notify(ep) => the target is notified that the last message has been received (channel is empty) possibly by a callback at the target (register_flush_notify_callback)

possibly this can all be implemented with existing constructs: uct_fence() (do we have this?) am_send(ep, FLUSH_CB_NOTIFY_TAG)

target counts the callback invocations, when count reaches the number of connected processes, it is safe to take a checkpoint

uct_flush(ctx): flush a context so we can restrict the scope, yet still overlap between multiple PEs ucp_flush_and_notify(ctx)

== Restart == If restarted from a system level checkpoint:

cleanup, reinit/replacement of addresses/keys
uct_dump_ckpt(fd), ucp_dump_ckpt(fd), uct_reload_ckpt(fd), ucp_reload_ckpt(fd) per-component "prepare for ckpt", to serialize the important state for each component.