Internals - GradedJestRisk/git-training GitHub Wiki

A great introduction, in video What about this ?

Table of Contents

Reference

To allow offline mode in distributed systems, all identifiers in Git must share the same strategy. Therefore, content-based identification has been used, specifically sha1 hash.

In git, every time on object should be specified, its reference is its SHA-1 hash, shortened SHA1

Save diff or full content ?

If you save diff, you use far less space. But you have high computational complexity: to check-out version N of a file, you have to sum add N-1 diffs

If you save file content, you use far more space. But you have low computational complexity: to check-out version N of a file, you have to check-out one file. Git save file content and can save space because:

  • identical file content are not duplicated (same BLOB);
  • you can use compression on BLOBS (packed-file)

Objects

General

Stored in .git/objects, use watch -n .5 tree .git to see them created

3 objects:

  • BLOB;
  • tree;
  • commit.
Content:
  • BLOB is raw data (no metadata, like filename)
  • tree is a set of entries (hash-tree)
    • filename + permissions + BLOB's reference
    • reference a its sub-folder
  • commit (aka snapshot, highest-level object in the repo)
    • folder's reference
    • parent commit's reference (backward in time)
    • commiter metadata (name, email)
Remarks:
  • 2 files with same content, different filename
    • = 2 entries in the tree, each with its filename, with same reference
    • = one BLOB (stored once), pointed by the reference
    • this is surprising, because you really have two files
  • 2 folders with the same content:
    • have same reference if all actions were committed by same people, in the same sequence, no matter the filesystem
    • have different reference if one step in the sequence action has been splitted in 2 commits
    • this is surprising, because everything look the same if you look at the final state

Example

A folder contains 2 files:

  • README.md (empty)
  • index.js, content console.log("hello, world !");
The last commit added index.js file.

BLOB 648dda

console.log("hello, world !");
+ 1 empty BLOB (README.md)

Tree

100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	README.md
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	index.js <= why same ???

Commit 0bc16

commit 0bc16eaa1e1782e399b9cb069e41ff0224a99234
tree 5145704886f0914c21d7c6f93856accde5ee80a0
parent c5a6b0640bcc130a14490c31bb40129e2b224bb1
author Pierre TOP <[email protected]> 1585207825 +0100
committer Pierre TOP <[email protected]> 1585207825 +0100

    Add index.js

Show content

List:

  • see object content git show <REF>
  • see "pretty print" object content, eg. commit git show --pretty=raw <REF>
  • see tree "pretty print" git ls-tree <REF>

Branch

General:

  • reference to a commit ("tip of the branch")

Misc

diff show difference between working directory and staging diff show --staged difference between staging area and repository

A branch is pointer to one commit, kind of "named reference" (string) to reference (hash). This commit is known as the "tip of the branch".

HEAD is another pointer to the last commit created, so git reset --hard HEAD checkout all in the previous commit (discarding all changes). This will discard any unreferenced commit over time, but you can use git reflog to get the commit hash to reset again, or use the HEAD{N} notation (the reference pointed to by HEAD N steps ago

git rm is same as:

  • rm
  • git add
⚠️ **GitHub.com Fallback** ⚠️