Evals - AUGMXNT/shisa GitHub Wiki
Repos
As we were working with a variety of different cloud instances over time, we forked a bunch of the repos to track our own fixes:
- llm-jp-eval - this was just to add bos_token, support Qwen testing (
trust_remote_code
)- scripts and outputs are in the
eval
folder in the main repo
- scripts and outputs are in the
- Rakuda - we created a template for our model, it kept barfing on inferencing and I just ripped it all out to use vLLM, which ended up being 50X faster as well
- Japanese MT-Bench - the FastChat codebase is super gnarly and I couldn't figure out their prompt-picking algo so I wedged in what we needed. It's also slow as heck and really needs to be replaced...
- llm-judge - start of a custom inferencer for MT-Bench-likes
- gpt4-autoeval - this adapts a gpt-4 autoeval of ELYZA-100 tasks
Review of Evals
llm-jp-eval - as in the DATASET.md, basically about a dozen reading comprehension style tests (more are being added atm)
Rakuda is interesting, basically an adaptation of MT-Bench llm_judge code bout using pairwise comparison ranking nstead, and with Japanese oriented Q&A.
Evals to Make
- Combined Task comparison
- JA output fluency error rate. (char rate)
- EN/JA leakage testing (char rate)
We have a prototype Streamlit Eval Server that's focused on letting you run/analyze ratings:
- For generations, allow splitting into shorter sections for rating
- Tests: 100% correct Japanese, Which do you like more, Various rubrics
- Users - metrics, ratings/stats, favorite, personal win/loss (motivation and filtering)
- Same speed output
- Review: https://www.reddit.com/r/LocalLLaMA/comments/18nn78x/building_an_llm_rating_platform_and_need_criteria/
Results
llm-jp-eval
I ran a lot of llm-eval-jp
benchmarks as sanity checks and baselines. The original llm-jp-eval
sheet started with LLM-jp's original leaderboard with some validation tests on models there to make sure that it was running but grew to include a lot more data
HuggingFace Leaderboard Output
All data is taken from [https://huggingface.co/open-llm-leaderboard](HF Leaderboard) and can also be found as a spreadsheet.
Our lower MMLU is probably due to this bug.
HF's leaderboard uses a specific version of lm-evaluation-harness
- once fixed though, I may revisit and include our fixed benchmark scores sometime if I have time.
Type | Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
---|---|---|---|---|---|---|---|---|
Base | mistralai/Mistral-7B-v0.1 | 60.97 | 59.98 | 83.31 | 64.16 | 42.15 | 78.37 | 37.83 |
FT | augmxnt/shisa-7b-v1 | 55.01 | 56.14 | 78.63 | 23.12 | 52.49 | 78.06 | 41.62 |
FT | mistralai/Mistral-7B-Instruct-v0.1 | 54.96 | 54.52 | 75.63 | 55.38 | 56.28 | 73.72 | 14.25 |
Base | augmxnt/shisa-base-7b-v1 | 51.64 | 52.30 | 77.63 | 23.12 | 42.40 | 78.53 | 35.86 |
FT | elyza/ELYZA-japanese-Llama-2-7b-instruct | 49.78 | 53.16 | 78.25 | 47.07 | 39.08 | 73.24 | 7.88 |
FT | elyza/ELYZA-japanese-Llama-2-7b-fast-instruct | 49.15 | 53.75 | 77.55 | 46.85 | 38.84 | 71.59 | 6.29 |
Base | elyza/ELYZA-japanese-Llama-2-7b | 48.70 | 52.22 | 76.42 | 44.60 | 37.92 | 72.69 | 8.34 |
FT | rinna/youri-7b-chat | 48.51 | 51.19 | 76.09 | 46.06 | 41.17 | 75.06 | 1.52 |
FT | elyza/ELYZA-japanese-Llama-2-7b-fast | 47.67 | 51.88 | 75.46 | 44.34 | 36.45 | 71.59 | 6.29 |
Base | rinna/youri-7b | 47.11 | 49.06 | 74.89 | 42.22 | 36.03 | 71.82 | 8.64 |
FT | llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0 | 31.77 | 26.88 | 44.78 | 23.12 | 45.19 | 50.67 | 0.00 |
FT | llm-jp/llm-jp-13b-instruct-full-jaster-v1.0 | 31.63 | 27.22 | 44.70 | 23.12 | 44.69 | 50.04 | 0.00 |
Base | cyberagent/open-calm-7b | 28.21 | 20.48 | 30.65 | 25.22 | 44.15 | 48.54 | 0.23 |