LLM as a Judge - AUGMXNT/shisa GitHub Wiki

During this project, we explored using ChatGPT4/gpt-4-0613 as a rater/critic to mixed results.

gpt-4-0613 only caught about 50% of basic Japanese language errors from output vs manual native-speaker review (n=100+)
Neither ChatGPT4 nor gpt-4-0613 could reliably rate quality of Japanese responses or separate rating categories, even on a single question, multi-turn basis even with extensive prompting.

The Rakuda blog post cites the MT-Bench paper's 83% agreement with human rankings, but this was only performed with English ratings, and AFAIK has not ever been validated for Japanese ratings. Based on my experience above, I have some doubts.

There are at least two projects, Yuzu.ai's Rakuda Benchmarkand JP Stability.ai's Japanese MT-Bench that basically port the FastChat MT-Bench code to use.

One thing to note is that when going through the code, there are some tricky issues worth point out:

Japanese MT-Bench defaults to a somewhat crazy few-shot English prompt, so you need to be very careful to go through the prompt templates and assign the right one to your model.
Rakuda uses the default prompt for each individual model, which is better, but for example, uses llama-2-chat's insanely neurotic English prompt (and English prompts for all the non-Japanese models) which is something to keep in mind when looking at their leaderboard - non-JA models are actually even stronger than they seem as in our testing, we found that swapping to a JA prompt gave a huge boost in Japanese language performance.
For testing, it seems that ratings (or Rakuda's pairwise testing) are typically done at temp 0.7 on a single generation for each model. For my testing, I set --num_choices=4 to try to counteract the sampling issues, although I believe that each model may need to have their own sample settings tweaked. Using do_sample=False may be the best way to compare models more reliably otherwise.

LLM as a Judge - AUGMXNT/shisa GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️