LEXAM - chunhualiao/public-docs GitHub Wiki

The paper didn't "improve the LLM (GPT-4o) itself" in the sense of retraining or fine-tuning the base model. Instead, they meticulously engineered and refined how the LLM (specifically GPT-4o) was used as a judge to make its evaluations align closely with human legal experts.

Here's how they achieved this, as detailed in Section 4.1 (Evaluation Metric for Open Questions) and validated in Section 6:

Specialized Prompt Engineering by Legal Experts:

Two of the paper's authors, both with formal doctoral-level legal training, drafted a highly specialized "judging prompt." This prompt was designed specifically for evaluating Swiss law school exam answers, instructing the LLM on how to assess accuracy, completeness, and legal reasoning. (The prompt itself is shown in Appendix E.2).
The prompt guided the LLM on aspects like:
- Focusing on accuracy, completeness, and legal reasoning.
- How to handle deviations or additional elements not in the reference answer (penalizing incorrect ones, not penalizing legally sound unmentioned ones).
- Assuming the reference answer is complete.
- How to proportionally score answers if reference answers have sub-points.
- The precise format for the output (Explanation, Constructive Feedback, Correctness Score).

Iterative Optimization through a Pilot Study:

The legal experts conducted a pilot study to iteratively optimize this specialized prompt. They used GPT-4o (the chosen LLM judge) with the draft prompt to evaluate a diverse sample of courses.
They refined the prompt based on GPT-4o's performance until they (the human legal experts) were satisfied with the judging performance.
A key challenge during this phase was calibrating penalties, especially for cases where the LLM's answer introduced plausible but incorrect information not found in the reference answer. This iterative process allowed them to fine-tune the instructions within the prompt to handle such nuances.

Rigorous Validation with Human Experts:

After developing the LLM judge (GPT-4o + optimized prompt), they conducted a rigorous validation process described in Section 6.
Three human legal experts independently annotated 50 question-answer pairs.
They then used the Alternative Annotator Test (AAT) to compare the LLM judge's evaluations against the human experts' evaluations.
The results (Table 4) showed that their GPT-4o judge significantly matched or exceeded the agreement levels of two out of the three human experts, even under the strictest criteria (ε = 0, meaning no "bonus" for being close). The LLM judge achieved a winning rate (ω) of 0.67.

In Essence, the "Improvement" Came From:

Leveraging deep legal domain expertise to craft a highly specific and nuanced set of instructions (the prompt).
Iteratively refining these instructions based on observed performance in a pilot study.

This careful prompt engineering and refinement process made the pre-existing GPT-4o model behave like a reliable legal expert for the specific task of evaluating exam answers within the LEXAM benchmark.

They made the application of the LLM as a judge highly effective, rather than altering the underlying LLM.