Vocabulary Families - golololologol/LLM-Distillery GitHub Wiki
Vocabulary families are used to identify if your models are compatible for distillation or not.
They are based on the vocabulary of the model, e.g. which tokens the tokenizer has (excluding special tokens)
The code of how it gets calculated goes like this:
all_tokens = tokenizer.get_vocab().keys()
special_tokens = tokenizer.get_added_vocab().keys()
base_tokens = sorted(set(all_tokens) - set(added_tokens))
tokenizer_sha = hashlib.sha256("".join(base_tokens).encode()).hexdigest()
This gets us the sha of the tokenizer that the model uses, so now we can compare that sha to the dictionary that was painstakingly collected, to know which vocabulary family that tokenizer belongs to:
def get_vocab_family(tokenizer=None, model_path="") -> str:
tokenizer = try_load_tokenizer(model_path) if tokenizer == None else tokenizer
tokenizer_sha = get_tokenizer_sha(tokenizer)
sha_to_family = {
"154a07d332d0466ce54d5e83190930dc872c95777c493653c48d6b6b01891377": "mistral",
"88dfafd1e6cd6fc3cf71600f1c8590ec6b457263267d801636320000a6f687e3": "llama_1|2",
...(a bunch more of these...)
"cabd41803ba4aa362c59603aa9fedd80d8eab202708beccce9f4e1e0b58eaf3f": "codellama",
"c2ed819dc3c535a3a64a10d492a39baa87b9cc7aa0a2c72adecc1b31e3e1b544": "jamba"
}
vocab_family = sha_to_family.get(tokenizer_sha, "Unknown") # type: ignore
return vocab_family