Evals - shisa-ai/shisa-v2 GitHub Wiki

The Nejumi LLM Leaderboard is the current most comprehensive leaderboard for testing Japanese LLM proficiency: https://wandb.ai/wandb-japan/llm-leaderboard/reports/Nejumi-LLM-Leaderboard-Evaluating-Japanese-Language-Proficiency--Vmlldzo2MzU3NzIy

As of 2024-05-22, augmxnt/shisa-gamma-7b is still the top scoring JA-tune open model, which is neat!

However, the score on the Nejumi Leaderboard is simple. It is simply the (normalized to 1) average of:

llm-jp-eval (1.1.0)
JA MT-Bench

Both of these benchmarks have some issues, but especially llm-jp-eval. I wrote an article about what I discovered about the performance here: https://huggingface.co/blog/leonardlin/llm-jp-eval-eval

While it's easy and cheap to run (only ~3 minutes to evaluate a 7B class model on an H100), as detailed in the article, I do not believe that it accurately reflects downstream performance (JA fluency, suitability for instruction or chat-based tasks) and I won't be using it for even intra-model relative performance testing as I don't believe that an improved llm-jp-eval score necessarily reflects a better model.

JA MT-Bench's also has some weaknesses - the main weakness is that it doesn't not adequately/correctly mark non-JA replies as incorrect (honestly, they should probably be zeros). For my JA MT-Bench tests I started using a simple regex-based heuristic to detect "% JA characters" which I've been since fine-tuning, but it easily can tell you if a model is replying in primarily Japanese or not.

Lightblue recently released Shaberi: A Suite of Japanese Chat Benchmarks, a framework that does LLM-as-a-Judge evals of ELYZA Tasks 100, JA MT-Bench, Rakuda, and Lighblue's own Tengu Bench.

I've made my own Shisa.AI Shaberi fork mainly to get my own judgements (and since I had to make a new JA MT-Bench set to access but I will probably be doing a fairly big rewrite/overall...

I switched to using LiteLLM to allow for eventual switching of both Judges (eg, PoLL), but also for potentially better native vLLM/HF generation. Right now Shaberi is problematic because it cannot be easily scripted via slurm (requires vLLM as separate process, non-deterministic loading time)
Native generation and batching will speed up generation greatly and is incompatible with the way Shaberi is currently written, I will be replacing the execution engine with my own llm-judge code soon...
The way that the configs, answers, and judgements are organized probably needs to be refactored for easier distribution
The benchmark judgement criteria needs a lot of work. Probably it should also be integrated with JA character heuristics on a question by question basis. For example, Llama 3 Instruct 70B only outputs 5% JA characters but scores 6.7. You'd expect the score to be closer to 0 for Japanese and this should be enforced much more strictly.
Scores should also be taken with a grain of salt as the score outputs depend on GPT-4's instruction following and I've noticed for maybe 1-2% of the times it fails! Needs to use guidance, outline or some other output enforcement, or retry
Probably needs to move to a DB to track state, runs, etc.

License	Model Name	AVG Score	ELYZA100	JA MT-Bench	Rakuda	Tengu-Bench	JA Char %
Proprietary	gpt-4-turbo-2024-04-09	8.75	8.78	8.74	9.18	8.31	90.74%
Proprietary	gpt-4-turbo-preview	8.67	8.94	8.61	9.28	7.84	89.63%
CC-BY-NC 4.0	CohereForAI/c4ai-command-r-plus	7.69	7.50	7.43	9.05	6.79	92.43%
Llama 3	shisa-ai/shisa-v1-llama3-70b.8e6	7.30	7.34	7.67	8.15	6.04	91.71%
Proprietary	gpt-3.5-turbo-0125	7.17	7.24	6.98	7.64	6.82	92.84%
Llama 3	shisa-ai/shisa-v1-llama3-70b.2e5	7.17	7.16	7.45	7.98	6.09	91.55%
CC-BY-NC 4.0	CohereForAI/c4ai-command-r-v01	7.08	6.08	6.94	8.63	6.68	92.76%
Apache 2.0	DataPilot/ArrowPro-7B-KUJIRA	7.05	6.84	6.53	8.69	6.13	91.35%
Llama 2	karakuri-ai/karakuri-lm-70b-chat-v0.1	6.84	6.86	6.43	7.85	6.23	90.04%
Qwen	lightblue/ao-karasu-72B	6.81	7.19	6.54	7.25	6.27	90.53%
Apache 2.0	DataPilot__ArrowPro-7B-RobinHood	6.72	6.62	6.02	8.29	5.96	91.08%
LLama 3	meta-llama/Meta-Llama-3-70B-Instruct	6.69	8.22	8.54	3.40	6.61	5.75%
Llama 3	shisa-v1-llama3-8b.8e6	6.59	6.67	6.95	7.05	5.68	91.30%
Llama 2	tokyotech-llm/Swallow-70b-instruct-v0.1	6.55	7.28	6.45	6.58	5.91	90.75%
Llama 3	shisa-ai/shisa-v1-llama3-8b.2e5	6.29	6.62	6.41	7.05	5.07	91.08%
Apache 2.0	shisa-ai/shisa-v1-swallow-13a47b	6.17	6.48	6.07	7.11	5.03	85.95%
Llama 3	lightblue/suzume-llama-3-8B-japanese	5.96	6.68	4.96	6.68	5.53	93.05%
LLama 3	meta-llama/Meta-Llama-3-8B-Instruct	5.83	6.68	7.68	3.13	5.85	5.05%
Apache 2.0	augmxnt/shisa-gamma-7b-v1	5.82	5.96	5.02	6.85	5.47	91.49%
Apache 2.0	Rakuten/RakutenAI-7B-chat	5.58	5.92	4.60	6.58	5.24	89.79%
Gemma	shisa-ai/shisa-v1-gemma-8b	5.64	6.50	5.42	5.10	5.55	90.72%
Llama 2	elyza/ELYZA-japanese-Llama-2-13b-instruct	5.26	5.60	4.31	5.63	5.52	88.15%
Qwen	lightblue/qarasu-14B-chat-plus-unleashed	5.20	5.58	4.74	5.46	5.01	90.32%
Apache 2.0	cyberagent/calm2-7b-chat	4.76	4.90	3.58	5.75	4.81	87.50%
Apache 2.0	mistralai/Mistral-7B-Instruct-v0.2	4.69	5.78	4.65	3.80	4.53	90.20%
Apache 2.0	shisa-ai/shisa-v1-yi1.5-9b	4.63	5.98	4.28	3.26	5.00	90.86%
Apache 2.0	augmxnt/shisa-7b-v1	4.50	4.63	3.95	4.89	4.53	90.83%
Unreleased	lightblue/starlingbeta_cult_dsir_megagon_final	3.59	4.96	2.69	3.15	3.55	90.55%