Architecture - arxanas/git-branchless GitHub Wiki

Crate documentation:

git-branchless is implemented in Rust. The package is called git-branchless, and it is implemented in a Rust crate called branchless.

Aside: in retrospect, the crate name branchless could be confused with some kind of library for high-performance branchless programming. Unfortunately, this library only aids high-velocity software development.

This page is intended for git-branchless developers or curious users. The main concepts are implemented in branchless::core.

Event log

git-branchless watches for events in the Git repo by installing various hooks (see branchless::hooks). These hooks add events to the event log, which is an ordered sequence of events stored on disk in a SQLite database. See the Event documentation for details about the types of events which can be recorded.

At present, on startup, git-branchless loads all events into memory, and then replays them to determine the current state of the repository (see EventReplayer). This could be slow if the user has done many operations and the event log is long.

Undo implementation

The undo feature is implemented by taking recent events from the event log and then applying their inverses. For example, if a commit A was rewritten to B, then the inverse operation is to rewrite B to A.

This might not be the best implementation, since some inverses don't make sense. For example, if the user rewrites a draft commit A into its upstream version contained in the main branch, should the inverse really rewrite a main branch commit into a draft commit? That results in the case of main branch commits being obsolete.

It might be best to introduce a dedicated "undo" event type, rather than attempt to invert previous events.

Checkpoints

Not yet implemented: To avoid performance problems when the event log is long, it should be possible to add "checkpoints" to the event log. A checkpoint would be a synthetic event that contains a copy of the repository state. Rather than replay all events in the event log, we can find the most recent checkpoint, load the repository state, and replay events only from that point. In this way, we can arbitrarily bound the number of events that need to be read and replayed in the worst case.

So far, I haven't hit performance problems with a few thousand local events, so I haven't prioritized this.

Comparison with the reflog

Git has a concept called "reference logs", or "reflogs" for short. A reflog is a history of events that happened to a single Git reference. This is pretty similar to our event log. In fact, the first version of git-branchless attempted to infer the repository history from the reflog for HEAD.

So why don't we use reflogs? Unfortunately, they have a number of shortcomings:

They don't store structured data. We have to guess what the event is doing based on the message.
For rewrite events, it's particularly difficult to figure out what the "old" version of the commit was. On the other hand, this information is directly exposed in the post-rewrite hook, if we wish to record it ourselves.
It's difficult to insert our own synthetic events into the reflog. Either due to a bug in libgit2/pygit2, or possibly due to a fundamental shortcoming in Git, I was unable to insert an entry for a reference which had a message but left the reference pointing to the same object as before.
Reflogs only exist for references which currently exist. Branches may be created and deleted. Once they're deleted, the reflog is also deleted!
- Usually, we use the reflog for HEAD to undo work. Sometimes the reflog for HEAD isn't touched, such as creating and deleting a branch pointing to a non-HEAD commit. In these cases, the historical reference information is entirely unrecoverable using reflogs.
- This means the user has to rely on unergonomic solutions to restore a deleted branch.
- In contrast, our git undo command can restore deleted branches — only because we don't delete event logs for deleted references.
Logically-related events in the same reflog aren't obviously grouped.
The ordering of events between different reflogs is difficult to ascertain. A single operation can end up creating reflog entries with the same timestamp in different reflogs, and it's not clear which logically happened first.
Reflogs can be edited or cleaned up.
- The user is free to modify the reflog as they like, which could break internal invariants, although this is probably not a big concern in practice.
- Git's garbage collection may prune reflog entries. It would be a better user experience for a git undo if we could tell the user "the operation cannot be undone because Git has garbage collected necessary objects", rather than have old events mysteriously missing from the history.

Related work

The reader might also be interested in Jujutsu, an experimental Git-compatible VCS which also has an "operation log".

I'm not aware of other source control systems which also use a general-purpose event log. Please update this section if you know of another one.

Commit evolution

git-branchless implements a basic version of Mercurial's Changeset Evolution feature.

For the implementation details, there's a good technical document here: https://www.mercurial-scm.org/doc/evolution/concepts.html

Normally, when a commit is amended or rebased, the result is an entirely new Git object, which has no direct relation to the old one. By leveraging the event log and the post-rewrite hook, we can record these relationships.

These are the important situations:

One commit is rewritten into one commit (e.g. a rebase or amend): the RewriteEvent has the old OID and the new OID.
Many commits are rewritten into one commit (e.g. an interactive rebase fixup or squash): There are multiple RewriteEvents with different old OIDs and the same new old OID.
One commit is rewritten into many commits (e.g. a split operation): a git split command is not implemented at the time of this writing, so there is no corresponding sequence of RewriteEvents.

Recording these events allows us to update the smartlog with the latest version of the commit, as well as undo these operations in a principled manner.

Segmented changelog

As of https://github.com/arxanas/git-branchless/commit/f6c540fea8392223d604c4994b081b603b3df850, the commit graph is based on Eden SCM's segmented changelog data structure. See the thread at https://github.com/quark-zju/gitrevset/issues/1 for more details. Some resources: