Tokenizer Efficiency - AUGMXNT/shisa GitHub Wiki
For the latest test on tokenizer efficiency, see: https://github.com/shisa-ai/shisa-v2/tree/main/eval/tokenizer-efficiency
Japanese efficiency from sampling 50K items (~85M characters) from the JA subset of the CulturaX dataset:
LLM | Tokenizer | Vocab Size | Avg Char/Token |
---|---|---|---|
Shisa 7B (AUGMXNT) | augmxnt/shisa-base-7b-v1 | 120073 | 2.31 |
OpenCALM (CyberAgent) | cyberagent/open-calm-7b | 52000 | 2.17 |
Japanese LargeLM (LINE) | line-corporation/japanese-large-lm-3.6b | 51200 | 2.14 |
CALM2-7B (CyberAgent) | cyberagent/calm2-7b | 65000 | 2.00 |
Bilingual-GPT-NeoX-4B (Rinna) | rinna/bilingual-gpt-neox-4b | 65536 | 1.88 |
Japanese StableLM Alpha (Stability AI) | novelai/nerdstash-tokenizer-v1 | 65535 | 1.85 |
Japanese-GPT-NeoX-3.6B (Rinna) | rinna/japanese-gpt-neox-3.6b | 32000 | 1.83 |
Japanese StableLM Beta JAVocab (Stability AI) | stabilityai/japanese-stablelm-base-ja_vocab-beta-7b | 49247 | 1.79 |
ELYZA 13B fast | ELYZA-japanese-Llama-2-13b-fast | 44581 | 1.77 |
Orion 14B (OrionStarAI) | OrionStarAI/Orion-14B-Base | 84608 | 1.71 |
llm-jp-13b (LLM-jp) | llm-jp/llm-jp-13b-v1.0 | 50570 | 1.65 |
RakutenAI-7B | Rakuten/RakutenAI-7B | 48000 | 1.61 |
Swallow 7B (TokyoTech-LLM) | tokyotech-llm/Swallow-7b-hf | 43176 | 1.55 |
Japanese-Llama-2-7b-fast (ELYZA) | elyza/ELYZA-japanese-Llama-2-7b-fast | 45043 | 1.53 |
Qwen 14B (Qwen) | Qwen/Qwen-14B | 151851 | 1.48 |
XVERSE 65B (xverse) | xverse/XVERSE-65B | 100534 | 1.10 |
weblab-10b (Matsuo Lab) | EleutherAI/gpt-neox-20b | 50254 | 1.00 |
Japanese StableLM Gamma (Stability AI) | mistralai/Mistral-7B-v0.1 | 32000 | 0.95 |
Youri 7B (Rinna) | meta-llama/Llama-2-7B | 32000 | 0.88 |
DeepSeek LLM 7B (DeepSeek) | deepseek-ai/deepseek-llm-7b-base | 102400 | 0.85 |
Yi 34B (01.ai) | 01-ai/Yi-34B | 64000 | 0.83 |
We also test English efficiency using a sampling of 50K items (~177M characters) from the EN subset of the CulturaX dataset as a sanity check (and to see how other tokenizers fare):
LLM | Tokenizer | Vocab Size | Avg Char/Token |
---|---|---|---|
Qwen 14B (Qwen) | Qwen/Qwen-14B | 151851 | 4.47 |
weblab-10b (Matsuo Lab) | EleutherAI/gpt-neox-20b | 50254 | 4.45 |
DeepSeek LLM 7B (DeepSeek) | deepseek-ai/deepseek-llm-7b-base | 102400 | 4.33 |
Orion 14B (OrionStarAI) | OrionStarAI/Orion-14B-Base | 84608 | 4.25 |
Yi 34B (01.ai) | 01-ai/Yi-34B | 64000 | 4.19 |
Japanese StableLM Alpha (Stability AI) | novelai/nerdstash-tokenizer-v1 | 65535 | 4.15 |
Shisa 7B (AUGMXNT) | augmxnt/shisa-base-7b-v1 | 120073 | 4.12 |
CALM2-7B (CyberAgent) | cyberagent/calm2-7b | 65000 | 4.12 |
Japanese StableLM Beta JAVocab (Stability AI) | stabilityai/japanese-stablelm-base-ja_vocab-beta-7b | 49247 | 4.01 |
Japanese StableLM Gamma (Stability AI) | mistralai/Mistral-7B-v0.1 | 32000 | 4.01 |
Swallow 7B (TokyoTech-LLM) | tokyotech-llm/Swallow-7b-hf | 43176 | 3.86 |
ELYZA 13B fast | ELYZA-japanese-Llama-2-13b-fast | 44581 | 3.86 |
Japanese-Llama-2-7b-fast (ELYZA) | elyza/ELYZA-japanese-Llama-2-7b-fast | 45043 | 3.86 |
Youri 7B (Rinna) | meta-llama/Llama-2-7B | 32000 | 3.86 |
llm-jp-13b (LLM-jp) | llm-jp/llm-jp-13b-v1.0 | 50570 | 3.79 |
XVERSE 65B (xverse) | xverse/XVERSE-65B | 100534 | 2.96 |
OpenCALM (CyberAgent) | cyberagent/open-calm-7b | 52000 | 2.83 |
Japanese LargeLM (LINE) | line-corporation/japanese-large-lm-3.6b | 51200 | 2.49 |
Japanese-GPT-NeoX-3.6B (Rinna) | rinna/japanese-gpt-neox-3.6b | 32000 | 2.42 |
Bilingual-GPT-NeoX-4B (Rinna) | rinna/bilingual-gpt-neox-4b | 65536 | 2.42 |