Trick Checkpoint Restart - nasa/gunns GitHub Wiki

For background, reference Trick's guide on checkpoint/restart knowledge and best practices.

This describes how we make GUNNS compatible with Trick's checkpoint/restart capability.

  • The checkpointed-ness of a variable is controlled with the trick_chkpnt_io() field in the Trick comment in the variable declaration. For example:
double mVelocity; /**< (m/s) trick_chkpnt_io(**) This is not checkpointed, by overriding the I/O setting.
                                                This is how we usually NOT checkpoint. */
  • The trick_chkpnt_io() field is optional, and should be put after the units field and before the comment field, when used.
  • If the trick_chkpnt_io() field is not specified, then the checkpointed-ness of a variable matches the Trick I/O field. For example:
double mPosition; /**<      (m)   This is checkpointed, because of the I/O setting.
                                  This is how we usually checkpoint. */
double mVelocity; /**< (**) (m/s) This is not checkpointed, because of the I/O setting. */
  • The trick_chkpnt_io() field overrides the Trick I/O field for determining checkpointed-ness. For example:
double mVelocity; /**< (**) (m/s) trick_chkpnt_io() This is checkpointed, by overriding the I/O setting. */
  • For any pointers, the trick_chkpnt_io() field only controls the checkpointed-ness of the pointer itself (the address value), not the thing that is pointed to.

    • An example of where it would be appropriate is if the pointer changes to different objects during runtime as part of your model state (for example, switching between state objects in a state machine).
    • NOTE: For pointers to Trick-managed internal memory allocations (not those using the _EXT macros), you must checkpoint the pointer. This is because internal Trick-managed allocations are deleted and re-allocated during a checkpoint restart, and checkpointing the pointer allows Trick to update the pointer to the new memory location after the restart re-alloc. Failing to checkpoint a pointer to Trick-managed internal dynamic allocations can cause your pointer to be pointing to the wrong memory after a restart, which can lead to unpredictable behavior, and usually the sim will crash.
  • For pointers to dynamic memory allocations, the pointer's trick_chkpnt_io() field does not control the checkpointed-ness of the dynamic memory itself.

    • If Trick knows about the dynamic memory (allocated with the TS_NEW_* macros instead of new/delete, etc.) then Trick always checkpoint/restarts the dynamic array, there's no way to prevent it.
    • If the array was allocated with the internal TS_NEW_* macro's (not the _EXT ones), then you must checkpoint the pointer (see the note above). This is because for internally Trick-allocated arrays, Trick deletes and re-allocates them during the checkpoint load, and any pointers to that allocation must be checkpointed so that Trick can update them to point to the new allocation address.
    • If the array was allocated with the external macro's (TS_NEW_*_EXT, etc.), then Trick does not delete and re-alloc the arrays, but it does save and load the array values to and from the checkpoint file during checkpoint/restart. This also means that you don't have to checkpoint the pointer to that allocation, since the address doesn't change during checkpoint/restart.
    • The only way to not checkpoint-restart a dynamic array at all is to hide it from Trick, i.e. use new/delete instead of TMM
      • But then it's not visible on TV, so this is a trade-off.
    • We should always use the _EXT macros instead of the internal macros. This supports 'checkpoint migration', or the practice of loading checkpoints into a different version or build of an executable than what the checkpoint was saved from. This is mainly a TS21 requirement and not needed by other projects, but we should still adhere to supporting this capability for backward-compatibility with TS21.
  • Because the Trick I/O field defaults to I/O enabled, and checkpointed-ness defaults to match the Trick I/O field, then by default, Trick checkpoints everything.

    • This is actually bad - it adds unneeded bloat to the checkpoint files, and more things that can break in the restart.
  • What to checkpoint? Basically, inputs and state:

    • All inputs to your model
    • All changing, persistent state - terms that can change but also need to persist from one pass to the next
      • This generally includes any output of a numerical integration, when the output is an input to next pass integral
      • Counters: (elapsedTime += dt; FrameCounter++) are also numerical integrations
      • Any other thing calculated by your model and needed to persist for output to others, or as an input next pass
      • This can also be thought of an input to your model, even if it is an attribute of the same class - it's an input from last pass
    • Any pointers to Trick-managed internal dynamic allocations (see above)
    • NOT constants or thing that never change, e.g. config data
    • NOT continuously re-calculated output that doesn't depend on its own value from last pass
    • NOT pointers to external (using the _EXT macros) dynamic allocations, or non-Trick managed (new/delete) dynamic allocations.
  • Sim Bus is NOT checkpointed

    • Therefore you should checkpoint your intended inputs from Sim Bus
  • Restart Jobs: All GUNNS networks have a restart function that should be called from a Trick restart job, if restart is needed.

    • The restart function clears model variables that are not checkpointed but have values left over from the prior run, and some re-initialization of internal terms from the checkpointed values.
    • The restart job looks like this in the Trick Sim Object:
    ("restart") network.restart(); 

There are a couple of Trick options that we should always use in Trick simulations using GUNNS and need checkpoint/restart. These are set in the Trick input file:

  • Turn off Trick's reduced checkpoint option. By default, Trick enables its reduced checkpoint option. This option omits saving zero values to the checkpoint file, in an attempt to reduce the checkpoint file size bloat. However, we can't use this option because any checkpointable term that has a value of zero when the checkpoint is cut will not be saved and reloaded at checkpoint load. Add this to the input file to turn off the reduced checkpoint option:
trick.TMM_reduced_checkpoint(False)
  • Turn on Trick's expanded arrays checkpoint option. This saves arrays in the checkpoint file with separate assignment statements for each index in the array, e.g. model.array[2] = 12.0, rather than as a python list, e.g. model.array = {10.0, 12.0, 14.0}. Expanding the array to individual assignment statements helps checkpoint migration (a TS21 requirement), because it allows Trick to ignore the extra assignment statements in a checkpoint file that exceed the current array size. Add this to the input file to turn on the expanded arrays option:
trick_mm.mm.set_expanded_arrays(True)
  • Goal of checkpoint/restart is repeatability. I call this the A-B-C-B-C test:
    • Run from time A to time B, cut a checkpoint at B, then continue on, recording the trajectory to time C
    • Restart back to the checkpoint at B, and run to C again.
    • The trajectory from checkpoint B to C should exactly match the original run from B to C.
    • We should checkpoint the bare minimum needed to achieve this match.
    • If running from the checkpoint diverges from B to D instead, then we didn't checkpoint something we needed.
    • Since our models and modeled physics are so highly connected, a divergence in your model could be caused by a missing term anywhere in the chain/loop of inputs to your model.
    • Therefore, everything has to be checkpoint-correct for repeatability to work.

Checkpoint_Repeatability

  • More reading about the TMM and checkpointing from Trick:

    • Trick's guide on checkpoint/restart knowledge and best practices.
    • trick/share/doc/trick/advanced/Trick_Memory_Manager_Overview.ppt
    • trick/share/doc/trick/advanced/Trick_Checkpointing.pptx
    • Ignore the stuff about the DMTCP checkpointing, as that stuff is obsolete and we don't use it. We only use the ASCII checkpointing.
  • TBD go into more detail about the checkpoint file itself, give examples of what stuff looks like in the file, etc.