Sync file changes - radumarias/rfs GitHub Wiki
Merkle tree
Is an interesting structure.
Maybe we can use this to detect changes of shards. After we get the file synced we can make a Merkle
tree with hashes of intervals of data,
we store that, and on another node if the chunks changes they recompute the Merkle
tree and we just compare the nodes from root , like a binary search just for several nodes. Like that we identify parts of the file, chunks, which changed.
Just need to be careful in append, we need to detect new branches and nodes, and on file shrink we need to detect removed nodes and branches.
There will be no inserts in the middle.
BLAKE3
BLAKE3 is a hash function that has an internal implicit merkle tree, with a leaf size of 1024 bytes.
bao
Bao is an implementation of BLAKE3 verified streaming, as described in Section 6.4 of the BLAKE3 spec. Tree hashes like BLAKE3 make it possible to verify part of a file without re-hashing the entire thing, using an encoding format that stores the bytes of the file together with all the nodes of its hash tree. Clients can stream this encoding, or do random seeks into it, while verifying that every byte they read matches the root hash. For the details of how this works, see the Bao spec.
In principle blake3 can do all these things, and it is also implemented in the lower level crate https://crates.io/crates/bao-tree . But we have not exposed all of this yet. Blake3 or really any tree hash is great for finding diffs, with one caveat. It can detect changes, e.g. you got a 1gb file and modify byte 12345678, it can quickly detect that all but 1 block is unchanged. But if you insert or remove data, due to the fact that it has a fixed chunk size of 1024 bytes, it will not find a small diff. For that you need either rabin chunking and chunks with variable chunk size, or the (very similar in a way) rsync algorithm... I personally don't find this a big limitation, since inserting and removing bytes in the middle of a large file is a very expensive operation even locally, so it is rarely done.
Will just find that everything before the insertion point is unchanged and everything behind the insertion point is "new"... If you do care, blake3 or any fixed size chunk algorithm is not for you. You need something like rabin chunking. https://en.wikipedia.org/wiki/Rabin_fingerprint