Tokenizer Efficiency - AUGMXNT/shisa GitHub Wiki
For the latest test on tokenizer efficiency, see: https://github.com/shisa-ai/shisa-v2/tree/main/eval/tokenizer-efficiency
Japanese efficiency from sampling 50K items (~85M characters) from the JA subset of the CulturaX dataset:
| LLM | Tokenizer | Vocab Size | Avg Char/Token | 
|---|---|---|---|
| Shisa 7B (AUGMXNT) | augmxnt/shisa-base-7b-v1 | 120073 | 2.31 | 
| OpenCALM (CyberAgent) | cyberagent/open-calm-7b | 52000 | 2.17 | 
| Japanese LargeLM (LINE) | line-corporation/japanese-large-lm-3.6b | 51200 | 2.14 | 
| CALM2-7B (CyberAgent) | cyberagent/calm2-7b | 65000 | 2.00 | 
| Bilingual-GPT-NeoX-4B (Rinna) | rinna/bilingual-gpt-neox-4b | 65536 | 1.88 | 
| Japanese StableLM Alpha (Stability AI) | novelai/nerdstash-tokenizer-v1 | 65535 | 1.85 | 
| Japanese-GPT-NeoX-3.6B (Rinna) | rinna/japanese-gpt-neox-3.6b | 32000 | 1.83 | 
| Japanese StableLM Beta JAVocab (Stability AI) | stabilityai/japanese-stablelm-base-ja_vocab-beta-7b | 49247 | 1.79 | 
| ELYZA 13B fast | ELYZA-japanese-Llama-2-13b-fast | 44581 | 1.77 | 
| Orion 14B (OrionStarAI) | OrionStarAI/Orion-14B-Base | 84608 | 1.71 | 
| llm-jp-13b (LLM-jp) | llm-jp/llm-jp-13b-v1.0 | 50570 | 1.65 | 
| RakutenAI-7B | Rakuten/RakutenAI-7B | 48000 | 1.61 | 
| Swallow 7B (TokyoTech-LLM) | tokyotech-llm/Swallow-7b-hf | 43176 | 1.55 | 
| Japanese-Llama-2-7b-fast (ELYZA) | elyza/ELYZA-japanese-Llama-2-7b-fast | 45043 | 1.53 | 
| Qwen 14B (Qwen) | Qwen/Qwen-14B | 151851 | 1.48 | 
| XVERSE 65B (xverse) | xverse/XVERSE-65B | 100534 | 1.10 | 
| weblab-10b (Matsuo Lab) | EleutherAI/gpt-neox-20b | 50254 | 1.00 | 
| Japanese StableLM Gamma (Stability AI) | mistralai/Mistral-7B-v0.1 | 32000 | 0.95 | 
| Youri 7B (Rinna) | meta-llama/Llama-2-7B | 32000 | 0.88 | 
| DeepSeek LLM 7B (DeepSeek) | deepseek-ai/deepseek-llm-7b-base | 102400 | 0.85 | 
| Yi 34B (01.ai) | 01-ai/Yi-34B | 64000 | 0.83 | 
We also test English efficiency using a sampling of 50K items (~177M characters) from the EN subset of the CulturaX dataset as a sanity check (and to see how other tokenizers fare):
| LLM | Tokenizer | Vocab Size | Avg Char/Token | 
|---|---|---|---|
| Qwen 14B (Qwen) | Qwen/Qwen-14B | 151851 | 4.47 | 
| weblab-10b (Matsuo Lab) | EleutherAI/gpt-neox-20b | 50254 | 4.45 | 
| DeepSeek LLM 7B (DeepSeek) | deepseek-ai/deepseek-llm-7b-base | 102400 | 4.33 | 
| Orion 14B (OrionStarAI) | OrionStarAI/Orion-14B-Base | 84608 | 4.25 | 
| Yi 34B (01.ai) | 01-ai/Yi-34B | 64000 | 4.19 | 
| Japanese StableLM Alpha (Stability AI) | novelai/nerdstash-tokenizer-v1 | 65535 | 4.15 | 
| Shisa 7B (AUGMXNT) | augmxnt/shisa-base-7b-v1 | 120073 | 4.12 | 
| CALM2-7B (CyberAgent) | cyberagent/calm2-7b | 65000 | 4.12 | 
| Japanese StableLM Beta JAVocab (Stability AI) | stabilityai/japanese-stablelm-base-ja_vocab-beta-7b | 49247 | 4.01 | 
| Japanese StableLM Gamma (Stability AI) | mistralai/Mistral-7B-v0.1 | 32000 | 4.01 | 
| Swallow 7B (TokyoTech-LLM) | tokyotech-llm/Swallow-7b-hf | 43176 | 3.86 | 
| ELYZA 13B fast | ELYZA-japanese-Llama-2-13b-fast | 44581 | 3.86 | 
| Japanese-Llama-2-7b-fast (ELYZA) | elyza/ELYZA-japanese-Llama-2-7b-fast | 45043 | 3.86 | 
| Youri 7B (Rinna) | meta-llama/Llama-2-7B | 32000 | 3.86 | 
| llm-jp-13b (LLM-jp) | llm-jp/llm-jp-13b-v1.0 | 50570 | 3.79 | 
| XVERSE 65B (xverse) | xverse/XVERSE-65B | 100534 | 2.96 | 
| OpenCALM (CyberAgent) | cyberagent/open-calm-7b | 52000 | 2.83 | 
| Japanese LargeLM (LINE) | line-corporation/japanese-large-lm-3.6b | 51200 | 2.49 | 
| Japanese-GPT-NeoX-3.6B (Rinna) | rinna/japanese-gpt-neox-3.6b | 32000 | 2.42 | 
| Bilingual-GPT-NeoX-4B (Rinna) | rinna/bilingual-gpt-neox-4b | 65536 | 2.42 |