Evals - shisa-ai/shisa-v2 GitHub Wiki

The Nejumi LLM Leaderboard is the current most comprehensive leaderboard for testing Japanese LLM proficiency: https://wandb.ai/wandb-japan/llm-leaderboard/reports/Nejumi-LLM-Leaderboard-Evaluating-Japanese-Language-Proficiency--Vmlldzo2MzU3NzIy

As of 2024-05-22, augmxnt/shisa-gamma-7b is still the top scoring JA-tune open model, which is neat!

However, the score on the Nejumi Leaderboard is simple. It is simply the (normalized to 1) average of:

  • llm-jp-eval (1.1.0)
  • JA MT-Bench

Both of these benchmarks have some issues, but especially llm-jp-eval. I wrote an article about what I discovered about the performance here: https://huggingface.co/blog/leonardlin/llm-jp-eval-eval

While it's easy and cheap to run (only ~3 minutes to evaluate a 7B class model on an H100), as detailed in the article, I do not believe that it accurately reflects downstream performance (JA fluency, suitability for instruction or chat-based tasks) and I won't be using it for even intra-model relative performance testing as I don't believe that an improved llm-jp-eval score necessarily reflects a better model.

JA MT-Bench's also has some weaknesses - the main weakness is that it doesn't not adequately/correctly mark non-JA replies as incorrect (honestly, they should probably be zeros). For my JA MT-Bench tests I started using a simple regex-based heuristic to detect "% JA characters" which I've been since fine-tuning, but it easily can tell you if a model is replying in primarily Japanese or not.

Lightblue recently released Shaberi: A Suite of Japanese Chat Benchmarks, a framework that does LLM-as-a-Judge evals of ELYZA Tasks 100, JA MT-Bench, Rakuda, and Lighblue's own Tengu Bench.

I've made my own Shisa.AI Shaberi fork mainly to get my own judgements (and since I had to make a new JA MT-Bench set to access but I will probably be doing a fairly big rewrite/overall...

  • I switched to using LiteLLM to allow for eventual switching of both Judges (eg, PoLL), but also for potentially better native vLLM/HF generation. Right now Shaberi is problematic because it cannot be easily scripted via slurm (requires vLLM as separate process, non-deterministic loading time)
  • Native generation and batching will speed up generation greatly and is incompatible with the way Shaberi is currently written, I will be replacing the execution engine with my own llm-judge code soon...
  • The way that the configs, answers, and judgements are organized probably needs to be refactored for easier distribution
  • The benchmark judgement criteria needs a lot of work. Probably it should also be integrated with JA character heuristics on a question by question basis. For example, Llama 3 Instruct 70B only outputs 5% JA characters but scores 6.7. You'd expect the score to be closer to 0 for Japanese and this should be enforced much more strictly.
  • Scores should also be taken with a grain of salt as the score outputs depend on GPT-4's instruction following and I've noticed for maybe 1-2% of the times it fails! Needs to use guidance, outline or some other output enforcement, or retry
  • Probably needs to move to a DB to track state, runs, etc.
License Model Name AVG Score ELYZA100 JA MT-Bench Rakuda Tengu-Bench JA Char %
Proprietary gpt-4-turbo-2024-04-09 8.75 8.78 8.74 9.18 8.31 90.74%
Proprietary gpt-4-turbo-preview 8.67 8.94 8.61 9.28 7.84 89.63%
CC-BY-NC 4.0 CohereForAI/c4ai-command-r-plus 7.69 7.50 7.43 9.05 6.79 92.43%
Llama 3 shisa-ai/shisa-v1-llama3-70b.8e6 7.30 7.34 7.67 8.15 6.04 91.71%
Proprietary gpt-3.5-turbo-0125 7.17 7.24 6.98 7.64 6.82 92.84%
Llama 3 shisa-ai/shisa-v1-llama3-70b.2e5 7.17 7.16 7.45 7.98 6.09 91.55%
CC-BY-NC 4.0 CohereForAI/c4ai-command-r-v01 7.08 6.08 6.94 8.63 6.68 92.76%
Apache 2.0 DataPilot/ArrowPro-7B-KUJIRA 7.05 6.84 6.53 8.69 6.13 91.35%
Llama 2 karakuri-ai/karakuri-lm-70b-chat-v0.1 6.84 6.86 6.43 7.85 6.23 90.04%
Qwen lightblue/ao-karasu-72B 6.81 7.19 6.54 7.25 6.27 90.53%
Apache 2.0 DataPilot__ArrowPro-7B-RobinHood 6.72 6.62 6.02 8.29 5.96 91.08%
LLama 3 meta-llama/Meta-Llama-3-70B-Instruct 6.69 8.22 8.54 3.40 6.61 5.75%
Llama 3 shisa-v1-llama3-8b.8e6 6.59 6.67 6.95 7.05 5.68 91.30%
Llama 2 tokyotech-llm/Swallow-70b-instruct-v0.1 6.55 7.28 6.45 6.58 5.91 90.75%
Llama 3 shisa-ai/shisa-v1-llama3-8b.2e5 6.29 6.62 6.41 7.05 5.07 91.08%
Apache 2.0 shisa-ai/shisa-v1-swallow-13a47b 6.17 6.48 6.07 7.11 5.03 85.95%
Llama 3 lightblue/suzume-llama-3-8B-japanese 5.96 6.68 4.96 6.68 5.53 93.05%
LLama 3 meta-llama/Meta-Llama-3-8B-Instruct 5.83 6.68 7.68 3.13 5.85 5.05%
Apache 2.0 augmxnt/shisa-gamma-7b-v1 5.82 5.96 5.02 6.85 5.47 91.49%
Apache 2.0 Rakuten/RakutenAI-7B-chat 5.58 5.92 4.60 6.58 5.24 89.79%
Gemma shisa-ai/shisa-v1-gemma-8b 5.64 6.50 5.42 5.10 5.55 90.72%
Llama 2 elyza/ELYZA-japanese-Llama-2-13b-instruct 5.26 5.60 4.31 5.63 5.52 88.15%
Qwen lightblue/qarasu-14B-chat-plus-unleashed 5.20 5.58 4.74 5.46 5.01 90.32%
Apache 2.0 cyberagent/calm2-7b-chat 4.76 4.90 3.58 5.75 4.81 87.50%
Apache 2.0 mistralai/Mistral-7B-Instruct-v0.2 4.69 5.78 4.65 3.80 4.53 90.20%
Apache 2.0 shisa-ai/shisa-v1-yi1.5-9b 4.63 5.98 4.28 3.26 5.00 90.86%
Apache 2.0 augmxnt/shisa-7b-v1 4.50 4.63 3.95 4.89 4.53 90.83%
Unreleased lightblue/starlingbeta_cult_dsir_megagon_final 3.59 4.96 2.69 3.15 3.55 90.55%