git internals - ghdrako/doc_snipets GitHub Wiki

Repository

Within a repository, Git maintains two primary data structures, the object store and the index. All of this repository data is stored at the root of your working directory in the hidden subdirectory named .git

The object store is designed to be efficiently copied during a clone operation as part of the mechanism that supports a fully distributed Version Control System. The index is transitory information, is private to a repository, and can be created or modified on demand as needed.

Object Store

Git places only four types of objects in the object store: the blobs, trees, commits, and tags. These four atomic objects form the foundation of Git’s higher level data structures.

  • Blobs Each version of a file is represented as a blob. Blob, a contraction of “binary large object,” is a term that’s commonly used in computing to refer to some variable or file that can contain any data and whose internal structure is ignored by the program. A blob is treated as being opaque. A blob holds a file’s data but does not contain any metadata about the file or even its name.

  • Trees A tree object represents one level of directory information. It records blob identifiers, path names, and a bit of metadata for all the files in one directory. It can also recursively reference other (sub)tree objects and thus build a complete hierarchy of files and subdirectories.

  • Commits A commit object holds metadata for each change introduced into the repository, including the author, committer, commit date, and log message. Each commit points to a tree object that captures, in one complete snapshot, the state of the repository at the time the commit was performed. The initial commit, or root commit, has no parent. Most commits have one commit parent, although later in the book we explain how a commit can reference more than one parent.

  • Tags A tag object assigns an arbitrary yet presumably human readable name to a specific object, usually a commit.

Index

The index stores binary data and is private to your repository. The content of the index is temporary and describes the structure of the entire repository at a specific moment in time. More specifically, it provides a cached representation of all the blob objects which reflects the current state of the project you are working on.

The information in the index is transitory, meaning it’s a dynamic stage between your project’s working directory (file system) and the repository’s object store (repository commit history). As such the index is also labeled as the “Staging Directory” interchangeably.

When Git places a file into the object store, it does so based on the hash of the data (file content) and not on the name of the file (file metadata). Git tracks content instead of files.

If two separate files have exactly the same content, whether in the same or different directories, Git only stores a single copy of that content as a blob within the object store. Git computes the hash code of each file according solely to its content, determines that the files have the same SHA1 values and thus the same content, and places the blob object in the object store indexed by that SHA1 value. Both files in the project, regardless of where they are located in the user’s directory structure, use that same object for content.

Git treats the name of a file as a piece of data that is distinct from the contents of that file.The names of files and directories come from the underlying filesystem, but Git does not really care about the names. Git merely records each pathname and makes sure it can accurately reproduce the files and directories from its content, which is indexed by a hash value. This set of information is stored in the Git object store as the tree object.

Git uses a efficient storage mechanism called packfiles. Git uses zlib a free software which implements the DEFLATE algorithm to compress each object prior to storing it in it’s object store.

Commits

Commit is a bag of contents (files and directories) that the user wants to store into the local repository. For each commit Git generates and ssigns a unique Id using the contents of the commit, user message, etc. This Id is unique and immutable. So, when a content inside a commit or its message label is changed, Git automatically regenerates a new commit and throws away the old one. Generally, commits are immutable, which means that commits are not allowed to be modified after they are created. When a change happens to a commit, a new one is created from scratch.

Internally, commits are chained together such that each commit is succeeded by the commit that is submitted after it. This chain is called a branch. Each commit on a branch can have multiple parents and/or children.

Pointers

Git uses pointers to manage the organization of the branches and commits. We can use these pointers to access a specific point in the history of commits. There are two types of pointers:

  • The first type are automatic pointers. These pointers are constructed automatically by Git and used for managing branches. The most useful pointers in this category are
    • branch pointers and
    • HEAD pointers.

Branch pointers always point to the last commit on a branch, so when a new commit is inserted into a branch, the branch pointer is automatically updated to point to the new commit.

Git uses a HEAD pointer to mark the current active branch. In contrast to the branch pointers, which are fixed on the last commit, we can reposition HEAD to any commit on any branch.

  • The second type of pointers are labels. We can create labels for commits of interest such that instead of using their Ids, we can use labels to access them.

To access a commit, we can use:

  • absolute addressing - we can switch on a commit using its Id.
  • relative addressing - the commit position relative to an already known pointer is used.

Git supports three types of relative access:

  • the depth operator “~”,
  • the horizontal parent operator “^”, and
  • the range operator “..”. We can access the same note with different addressing patterns: MASTER2 is the same as MASTER ^12

Example

The tilde operator “~” indicates a commit relative to the specific point on the same branch. The left side of the operator is the target pointer, and the right side is the number of commits before the target point:

MASTER~1: One commit before MASTER
MASTER~3: Three commits before MASTER
6ca0867~2 means two commits before the commit with Id 6ca0867

And generally

Pointer~n: n commit before Pointer

Commits could have multiple parents. This happens when two or more branches are merged. We can specify the parent we need by using the caret operator “^”.

MASTER^1: First parent of the commit that MASTER points
MASTER^2: Second parent of the commit that MASTER points

And in general

Pointer^n: nth parent of Pointer

The double dot “..” is a range operator. It returns the commits in between the selected range.

MASTER~4..MASTER~1: Selects commits in between four commits ahead of MASTER (not including the fourth one) and the one before MASTER

All commits, files and directories inside the commits, and all files inside the stage area have their unique identifiers.

git log --oneline # show commits id
git ls-tree 4e9d822  #  ls-tree command shows the contents of a commit. Files with unique ids
git show 8d29b8d0    # open file using unique id from previous command
git ls-tree d7a6244  # if id is directory command show it content

git ls-file --stage  #  show ids files in staging area

Comparing

Git supports four comparing algorithms – Myers, Patience, Minimal, and Histogram – to calculate differences between files. These algorithms support comparing files, commits, and branches.

git diff File1.txt # differences between the stage area and workspace
git diff --staged -- File1.txt # compares the last commit in the local repository and the stage area copy
git diff MASTER -- File1.txt # compare local repository with the workspace

Branches

Git branches are pointers. When a new branch is created, nothing is recreated or copied into a new location. All commits remain in their place, and a new pointer is added on the commit chain.

Each time only one branch could be active. Git uses pointer named HEAD to mark the current active branch. When we switch between branches, HEAD is automatically switched in the background.

Normally HEAD points to the last commit on the current active branch; however, it is not mandated to be on the last commit all the time. We can move the HEAD position to any commit on the current active branch. This is useful when we need to return the contents of the workspace back to a specific point in time.

Commands

To get a detailed categorized list of all the commands, type in git help -a in your terminal. Git commands are categorized as follows :

  • Main Porcelain Commands (High level commands for routine Git operations)
  • Ancillary Commands (Commands that help query Git’s internal data store)
  • Low-level Commands (Plumbing Commands for internal Git Operations)
  • External Commands (Commands that extent the standard Git Operations)
  • Commands to act as a bridge with selected version control tool (Interacting with Others Commands)
  • Command Aliases (Custom aliases created by users to mask complex Git commands)