Evals - shisa-ai/shisa-v2 GitHub Wiki
The Nejumi LLM Leaderboard is the current most comprehensive leaderboard for testing Japanese LLM proficiency: https://wandb.ai/wandb-japan/llm-leaderboard/reports/Nejumi-LLM-Leaderboard-Evaluating-Japanese-Language-Proficiency--Vmlldzo2MzU3NzIy
As of 2024-05-22, augmxnt/shisa-gamma-7b
is still the top scoring JA-tune open model, which is neat!
However, the score on the Nejumi Leaderboard is simple. It is simply the (normalized to 1) average of:
- llm-jp-eval (1.1.0)
- JA MT-Bench
Both of these benchmarks have some issues, but especially llm-jp-eval
. I wrote an article about what I discovered about the performance here: https://huggingface.co/blog/leonardlin/llm-jp-eval-eval
While it's easy and cheap to run (only ~3 minutes to evaluate a 7B class model on an H100), as detailed in the article, I do not believe that it accurately reflects downstream performance (JA fluency, suitability for instruction or chat-based tasks) and I won't be using it for even intra-model relative performance testing as I don't believe that an improved llm-jp-eval score necessarily reflects a better model.
JA MT-Bench's also has some weaknesses - the main weakness is that it doesn't not adequately/correctly mark non-JA replies as incorrect (honestly, they should probably be zeros). For my JA MT-Bench tests I started using a simple regex-based heuristic to detect "% JA characters" which I've been since fine-tuning, but it easily can tell you if a model is replying in primarily Japanese or not.
Lightblue recently released Shaberi: A Suite of Japanese Chat Benchmarks, a framework that does LLM-as-a-Judge evals of ELYZA Tasks 100, JA MT-Bench, Rakuda, and Lighblue's own Tengu Bench.
I've made my own Shisa.AI Shaberi fork mainly to get my own judgements (and since I had to make a new JA MT-Bench set to access but I will probably be doing a fairly big rewrite/overall...
- I switched to using LiteLLM to allow for eventual switching of both Judges (eg, PoLL), but also for potentially better native vLLM/HF generation. Right now Shaberi is problematic because it cannot be easily scripted via slurm (requires vLLM as separate process, non-deterministic loading time)
- Native generation and batching will speed up generation greatly and is incompatible with the way Shaberi is currently written, I will be replacing the execution engine with my own llm-judge code soon...
- The way that the configs, answers, and judgements are organized probably needs to be refactored for easier distribution
- The benchmark judgement criteria needs a lot of work. Probably it should also be integrated with JA character heuristics on a question by question basis. For example, Llama 3 Instruct 70B only outputs 5% JA characters but scores 6.7. You'd expect the score to be closer to 0 for Japanese and this should be enforced much more strictly.
- Scores should also be taken with a grain of salt as the score outputs depend on GPT-4's instruction following and I've noticed for maybe 1-2% of the times it fails! Needs to use guidance, outline or some other output enforcement, or retry
- Probably needs to move to a DB to track state, runs, etc.
License | Model Name | AVG Score | ELYZA100 | JA MT-Bench | Rakuda | Tengu-Bench | JA Char % |
---|---|---|---|---|---|---|---|
Proprietary | gpt-4-turbo-2024-04-09 | 8.75 | 8.78 | 8.74 | 9.18 | 8.31 | 90.74% |
Proprietary | gpt-4-turbo-preview | 8.67 | 8.94 | 8.61 | 9.28 | 7.84 | 89.63% |
CC-BY-NC 4.0 | CohereForAI/c4ai-command-r-plus | 7.69 | 7.50 | 7.43 | 9.05 | 6.79 | 92.43% |
Llama 3 | shisa-ai/shisa-v1-llama3-70b.8e6 | 7.30 | 7.34 | 7.67 | 8.15 | 6.04 | 91.71% |
Proprietary | gpt-3.5-turbo-0125 | 7.17 | 7.24 | 6.98 | 7.64 | 6.82 | 92.84% |
Llama 3 | shisa-ai/shisa-v1-llama3-70b.2e5 | 7.17 | 7.16 | 7.45 | 7.98 | 6.09 | 91.55% |
CC-BY-NC 4.0 | CohereForAI/c4ai-command-r-v01 | 7.08 | 6.08 | 6.94 | 8.63 | 6.68 | 92.76% |
Apache 2.0 | DataPilot/ArrowPro-7B-KUJIRA | 7.05 | 6.84 | 6.53 | 8.69 | 6.13 | 91.35% |
Llama 2 | karakuri-ai/karakuri-lm-70b-chat-v0.1 | 6.84 | 6.86 | 6.43 | 7.85 | 6.23 | 90.04% |
Qwen | lightblue/ao-karasu-72B | 6.81 | 7.19 | 6.54 | 7.25 | 6.27 | 90.53% |
Apache 2.0 | DataPilot__ArrowPro-7B-RobinHood | 6.72 | 6.62 | 6.02 | 8.29 | 5.96 | 91.08% |
LLama 3 | meta-llama/Meta-Llama-3-70B-Instruct | 6.69 | 8.22 | 8.54 | 3.40 | 6.61 | 5.75% |
Llama 3 | shisa-v1-llama3-8b.8e6 | 6.59 | 6.67 | 6.95 | 7.05 | 5.68 | 91.30% |
Llama 2 | tokyotech-llm/Swallow-70b-instruct-v0.1 | 6.55 | 7.28 | 6.45 | 6.58 | 5.91 | 90.75% |
Llama 3 | shisa-ai/shisa-v1-llama3-8b.2e5 | 6.29 | 6.62 | 6.41 | 7.05 | 5.07 | 91.08% |
Apache 2.0 | shisa-ai/shisa-v1-swallow-13a47b | 6.17 | 6.48 | 6.07 | 7.11 | 5.03 | 85.95% |
Llama 3 | lightblue/suzume-llama-3-8B-japanese | 5.96 | 6.68 | 4.96 | 6.68 | 5.53 | 93.05% |
LLama 3 | meta-llama/Meta-Llama-3-8B-Instruct | 5.83 | 6.68 | 7.68 | 3.13 | 5.85 | 5.05% |
Apache 2.0 | augmxnt/shisa-gamma-7b-v1 | 5.82 | 5.96 | 5.02 | 6.85 | 5.47 | 91.49% |
Apache 2.0 | Rakuten/RakutenAI-7B-chat | 5.58 | 5.92 | 4.60 | 6.58 | 5.24 | 89.79% |
Gemma | shisa-ai/shisa-v1-gemma-8b | 5.64 | 6.50 | 5.42 | 5.10 | 5.55 | 90.72% |
Llama 2 | elyza/ELYZA-japanese-Llama-2-13b-instruct | 5.26 | 5.60 | 4.31 | 5.63 | 5.52 | 88.15% |
Qwen | lightblue/qarasu-14B-chat-plus-unleashed | 5.20 | 5.58 | 4.74 | 5.46 | 5.01 | 90.32% |
Apache 2.0 | cyberagent/calm2-7b-chat | 4.76 | 4.90 | 3.58 | 5.75 | 4.81 | 87.50% |
Apache 2.0 | mistralai/Mistral-7B-Instruct-v0.2 | 4.69 | 5.78 | 4.65 | 3.80 | 4.53 | 90.20% |
Apache 2.0 | shisa-ai/shisa-v1-yi1.5-9b | 4.63 | 5.98 | 4.28 | 3.26 | 5.00 | 90.86% |
Apache 2.0 | augmxnt/shisa-7b-v1 | 4.50 | 4.63 | 3.95 | 4.89 | 4.53 | 90.83% |
Unreleased | lightblue/starlingbeta_cult_dsir_megagon_final | 3.59 | 4.96 | 2.69 | 3.15 | 3.55 | 90.55% |