LLM as a Judge - AUGMXNT/shisa GitHub Wiki
During this project, we explored using ChatGPT4/gpt-4-0613
as a rater/critic to mixed results.
-
gpt-4-0613
only caught about 50% of basic Japanese language errors from output vs manual native-speaker review (n=100+) - Neither ChatGPT4 nor
gpt-4-0613
could reliably rate quality of Japanese responses or separate rating categories, even on a single question, multi-turn basis even with extensive prompting.
The Rakuda blog post cites the MT-Bench paper's 83% agreement with human rankings, but this was only performed with English ratings, and AFAIK has not ever been validated for Japanese ratings. Based on my experience above, I have some doubts.
There are at least two projects, Yuzu.ai's Rakuda Benchmarkand JP Stability.ai's Japanese MT-Bench that basically port the FastChat MT-Bench code to use.
One thing to note is that when going through the code, there are some tricky issues worth point out:
- Japanese MT-Bench defaults to a somewhat crazy few-shot English prompt, so you need to be very careful to go through the prompt templates and assign the right one to your model.
- Rakuda uses the default prompt for each individual model, which is better, but for example, uses llama-2-chat's insanely neurotic English prompt (and English prompts for all the non-Japanese models) which is something to keep in mind when looking at their leaderboard - non-JA models are actually even stronger than they seem as in our testing, we found that swapping to a JA prompt gave a huge boost in Japanese language performance.
- For testing, it seems that ratings (or Rakuda's pairwise testing) are typically done at
temp 0.7
on a single generation for each model. For my testing, I set--num_choices=4
to try to counteract the sampling issues, although I believe that each model may need to have their own sample settings tweaked. Usingdo_sample=False
may be the best way to compare models more reliably otherwise.