Vocabulary Families - golololologol/LLM-Distillery GitHub Wiki

Vocabulary families are used to identify if your models are compatible for distillation or not.

They are based on the vocabulary of the model, e.g. which tokens the tokenizer has (excluding special tokens)

The code of how it gets calculated goes like this:

all_tokens = tokenizer.get_vocab().keys()
special_tokens = tokenizer.get_added_vocab().keys()

base_tokens = sorted(set(all_tokens) - set(added_tokens))
tokenizer_sha = hashlib.sha256("".join(base_tokens).encode()).hexdigest()

This gets us the sha of the tokenizer that the model uses, so now we can compare that sha to the dictionary that was painstakingly collected, to know which vocabulary family that tokenizer belongs to:

def get_vocab_family(tokenizer=None, model_path="") -> str:

    tokenizer = try_load_tokenizer(model_path) if tokenizer == None else tokenizer

    tokenizer_sha = get_tokenizer_sha(tokenizer)


    sha_to_family = {
        "154a07d332d0466ce54d5e83190930dc872c95777c493653c48d6b6b01891377": "mistral",
        "88dfafd1e6cd6fc3cf71600f1c8590ec6b457263267d801636320000a6f687e3": "llama_1|2",
        ...(a bunch more of these...)
        "cabd41803ba4aa362c59603aa9fedd80d8eab202708beccce9f4e1e0b58eaf3f": "codellama",
        "c2ed819dc3c535a3a64a10d492a39baa87b9cc7aa0a2c72adecc1b31e3e1b544": "jamba"
    }

  

    vocab_family = sha_to_family.get(tokenizer_sha, "Unknown") # type: ignore

    return vocab_family
⚠️ **GitHub.com Fallback** ⚠️