Architecture - arxanas/git-branchless Wiki
git-branchless is implemented in Rust. The package is called
git-branchless, and it is implemented in a Rust crate called
Aside: in retrospect, the crate name
branchlesscould be confused with some kind of library for high-performance branchless programming. Unfortunately, this library only aids high-velocity software development.
This page is intended for
git-branchless developers or curious users. The main concepts are implemented in
- Event log
- Commit evolution
- Segmented changelog
git-branchless watches for events in the Git repo by installing various hooks (see
branchless::hooks). These hooks add events to the event log, which is an ordered sequence of events stored on disk in a SQLite database. See the
Event documentation for details about the types of events which can be recorded.
At present, on startup,
git-branchless loads all events into memory, and then replays them to determine the current state of the repository (see
EventReplayer). This could be slow if the user has done many operations and the event log is long.
The undo feature is implemented by taking recent events from the event log and then applying their inverses. For example, if a commit A was rewritten to B, then the inverse operation is to rewrite B to A.
This might not be the best implementation, since some inverses don't make sense. For example, if the user rewrites a draft commit A into its upstream version contained in the main branch, should the inverse really rewrite a main branch commit into a draft commit? That results in the case of main branch commits being obsolete.
It might be best to introduce a dedicated "undo" event type, rather than attempt to invert previous events.
Not yet implemented: To avoid performance problems when the event log is long, it should be possible to add "checkpoints" to the event log. A checkpoint would be a synthetic event that contains a copy of the repository state. Rather than replay all events in the event log, we can find the most recent checkpoint, load the repository state, and replay events only from that point. In this way, we can arbitrarily bound the number of events that need to be read and replayed in the worst case.
So far, I haven't hit performance problems with a few thousand local events, so I haven't prioritized this.
Comparison with the reflog
Git has a concept called "reference logs", or "reflogs" for short. A reflog is a history of events that happened to a single Git reference. This is pretty similar to our event log. In fact, the first version of
git-branchless attempted to infer the repository history from the reflog for
So why don't we use reflogs? Unfortunately, they have a number of shortcomings:
- They don't store structured data. We have to guess what the event is doing based on the message.
- For rewrite events, it's particularly difficult to figure out what the "old" version of the commit was. On the other hand, this information is directly exposed in the
post-rewritehook, if we wish to record it ourselves.
- It's difficult to insert our own synthetic events into the reflog. Either due to a bug in
pygit2, or possibly due to a fundamental shortcoming in Git, I was unable to insert an entry for a reference which had a message but left the reference pointing to the same object as before.
- Reflogs only exist for references which currently exist. Branches may be created and deleted. Once they're deleted, the reflog is also deleted!
- Usually, we use the reflog for
HEADto undo work. Sometimes the reflog for
HEADisn't touched, such as creating and deleting a branch pointing to a non-HEAD commit. In these cases, the historical reference information is entirely unrecoverable using reflogs.
- This means the user has to rely on unergonomic solutions to restore a deleted branch.
- In contrast, our
git undocommand can restore deleted branches — only because we don't delete event logs for deleted references.
- Usually, we use the reflog for
- Logically-related events in the same reflog aren't obviously grouped.
- The ordering of events between different reflogs is difficult to ascertain. A single operation can end up creating reflog entries with the same timestamp in different reflogs, and it's not clear which logically happened first.
- Reflogs can be edited or cleaned up.
- The user is free to modify the reflog as they like, which could break internal invariants, although this is probably not a big concern in practice.
- Git's garbage collection may prune reflog entries. It would be a better user experience for a
git undoif we could tell the user "the operation cannot be undone because Git has garbage collected necessary objects", rather than have old events mysteriously missing from the history.
The reader might also be interested in
Jujutsu, an experimental Git-compatible VCS which also has an "operation log".
I'm not aware of other source control systems which also use a general-purpose event log. Please update this section if you know of another one.
git-branchless implements a basic version of Mercurial's Changeset Evolution feature.
For the implementation details, there's a good technical document here: https://www.mercurial-scm.org/doc/evolution/concepts.html
Normally, when a commit is amended or rebased, the result is an entirely new Git object, which has no direct relation to the old one. By leveraging the event log and the
post-rewrite hook, we can record these relationships.
These are the important situations:
- One commit is rewritten into one commit (e.g. a rebase or amend): the
RewriteEventhas the old OID and the new OID.
- Many commits are rewritten into one commit (e.g. an interactive rebase
squash): There are multiple
RewriteEventswith different old OIDs and the same new old OID.
- One commit is rewritten into many commits (e.g. a
git splitcommand is not implemented at the time of this writing, so there is no corresponding sequence of
Recording these events allows us to update the smartlog with the latest version of the commit, as well as undo these operations in a principled manner.
As of https://github.com/arxanas/git-branchless/commit/f6c540fea8392223d604c4994b081b603b3df850, the commit graph is based on Eden SCM's segmented changelog data structure. See the thread at https://github.com/quark-zju/gitrevset/issues/1 for more details. Some resources: