Git for Java Developers - loefberg/nitwit GitHub Wiki

Behold Git - a database with snapshots. That's about it.

If we want to be a bit more technical we can say that it is a key-value database. You put in a binary blob, and get a key back. The same binary blob will always return the same key. On top of that we implement snapshots by creating a file with all the keys we've put in, together with their file names. Then we put that file in the database, and the key we get back we call a snapshot/commit. Anyone who wants to look at your snapshot only has to lookup the key, get the list and then get the files listed in there. If we make the key the SHA-1 hash of the binary blob we get "data integrity".

Learning Git can be a bit tricky. There is a lot that does not make sense if you look at is as a VSC. I gave that up, and decided to learn Git by implementing a Git client. Lets start with the Pro Git book by Scott Chacon and Ben Straub, and directly flip to one of the last chapters: 10 Git Internals

There are two sets of git commands: porcelain and plumbing. The git checkout and all the other things you've probably heard about are "porcelain" commands. We are not interested in them now. We'll start with the "plumbing" commands.

Storing and retrieving data

We create a new git project and look in the .git/objects directory.

# git init test
# ls .git/objects
info  pack

Not much there. Two subdirectories but no files. We do what the book suggests and run

# echo 'test content' | git hash-object -w --stdin
d670460b4b4aece5915caf5c68d12f560a9fe3e4

This command took the content from stdin (--stdin) and wrote it to the database (-w), then it wrote the hash of the content plus a header to stdout. If we look in the .git/objects directory:

# find .git/objects -type f
.git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4

To limit the number of files in the .git/objects directory it creates a subdirectory based on the first two characters in the hash. This should limit the amount of subdirectories to 256 (0xff). Presumably there can be an a lot of files in each subdirectory, but Git has a mechanism for packing files together when it becomes cluttered (pack). We can now get the content back, by running

# git cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4
test content

Using the plumbing commands we read and write to the database.

cat-file

Jumping down a bit in the chapter to Object Storage we learn that the object files are compressed with DEFLATE, and the format is

format = header ' ' '\0' content
header = type length

Lets implement our own cat-file

public static void main(String[] args) throws Exception {
    Path gitDir = Paths.get("temp/test/.git");
    Path objectFile = getObjectPath(gitDir, "d670460b4b4aece5915caf5c68d12f560a9fe3e4");
    byte[] uncompressed = inflate(objectFile);
    int idx = indexOf(uncompressed, 0);
    String header = new String(uncompressed, 0, idx, StandardCharsets.UTF_8);
    String content = new String(uncompressed, idx + 1, uncompressed.length - idx - 1, StandardCharsets.UTF_8);
    System.out.println(content);
}

private static Path getObjectPath(Path gitDir, String hash) {
    return gitDir.resolve("objects").resolve(hash.substring(0, 2)).resolve(hash.substring(2));
}

private static byte[] inflate(Path file) throws IOException {
    try(var input = new InflaterInputStream(Files.newInputStream(file))) {
        return input.readAllBytes();
    }
}

private static int indexOf(byte[] arr, int searchElement) {
    for(int i = 0; i < arr.length; i++) {
        if(arr[i] == searchElement) {
            return i;
        }
    }
    return -1;
}

hash-object

Writing the object to the database is the reverse of reading it, with the added step that we have to calculate the SHA-1 hash. The hash is calculated on the header and the content.

public static void main(String[] args) throws Exception {
    Path gitDir = Paths.get("temp/test/.git");
    byte[] content = "what is up, doc?".getBytes(StandardCharsets.UTF_8);
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    out.write(String.format("blob %s\0", content.length).getBytes(StandardCharsets.UTF_8));
    out.write(content);
    out.close();
    byte[] fileContent = out.toByteArray();
    String hash = hash(fileContent);
    deflate(fileContent, makeObjectPath(gitDir, hash));
    System.out.println(hash);
}

private static final char[] HEX_ARRAY = "0123456789abcdef".toCharArray();
private static String hash(byte[] content) {
    byte[] hash;
    try {
        MessageDigest md = MessageDigest.getInstance("SHA-1");
        hash = md.digest(content);
    } catch (NoSuchAlgorithmException ex) {
        throw new RuntimeException(ex);
    }
    char[] hexChars = new char[hash.length * 2];
    for (int j = 0; j < hash.length; j++) {
        int v = hash[j] & 0xFF;
        hexChars[j * 2] = HEX_ARRAY[v >>> 4];
        hexChars[j * 2 + 1] = HEX_ARRAY[v & 0x0F];
    }
    return new String(hexChars);
}

private static Path makeObjectPath(Path gitDir, String hash) throws IOException {
    Path parent = gitDir.resolve("objects").resolve(hash.substring(0, 2));
    if(!Files.exists(parent)) {
        Files.createDirectory(parent);
    }
    return parent.resolve(hash.substring(2));
}

private static void deflate(byte[] content, Path file) throws IOException {
    try(var output = new DeflaterOutputStream(Files.newOutputStream(file))) {
        output.write(content);
    }
}

We can now read and write data from the database. In the next step we will use this to create a filesystem.

Creating a filesystem