legal review - chunhualiao/public-docs GitHub Wiki
Harvey and CoCounsel ranked at/near the top across multiple tasks in Valsβ 2025 industry study with law-firm partners; worth a demo if you prefer an out-of-the-box workflow.
Yes, there are several benchmarks and leaderboards designed to evaluate AI models (particularly large language models or LLMs) on tasks related to reviewing legal agreements, such as clause extraction, risk identification, summarization, metadata extraction, and correction of contract language. These benchmarks focus on legal reasoning, contract understanding, and domain-specific performance, often using datasets like annotated contracts. While there isn't a single universal leaderboard (unlike general AI benchmarks like MMLU), results are scattered across academic papers, company reports, and dedicated platforms. Key ones include:
- LegalBench: A collaborative benchmark with 162 tasks covering legal reasoning, including interpretation, rule application, and issue-spotting, which are relevant to contract review (e.g., analyzing terms, identifying risks). It's hosted on GitHub and updated with model evaluations.
- CUAD (Contract Understanding Atticus Dataset): A specialized dataset with 510 commercial contracts annotated for 41 clause types (e.g., governing law, parties, expiration dates). It tests models on extracting and understanding risky or key clauses in legal agreements.
- ContractLaw (from Vals AI): Focuses on contract-related tasks like extraction (pulling relevant phrases), matching (checking against standards), and correction (fixing non-standard language) across agreement types like NDAs and MSAs.
- SpotDraft LLM Benchmark: A practical evaluation of LLMs on legal tasks like contract review (risk comparison against playbooks), summarization, metadata extraction, and party identification, using real-world contracts.
- Other notable ones: ContractEval (an extension of CUAD for clause-level risk detection); Vals AI's CaseLaw (related to legal precedents but applicable to agreement analysis); and company-specific benchmarks like those from Thomson Reuters for long-context legal document handling.
These benchmarks often report metrics like accuracy, F1 score, recall, precision, latency, and cost. Results can vary by task, and proprietary models (e.g., from OpenAI, Anthropic) generally outperform open-source ones, though open-source models like Llama are closing the gap. Benchmarks are updated periodically, with recent evaluations (as of mid-2025) incorporating newer models like GPT-5 and Llama 3.1.
Highest-Ranking Models
No model dominates every benchmark or task, as performance depends on specifics like context length, reasoning depth, or cost-efficiency. However, based on aggregated recent results (from 2024β2025 evaluations), the top performers for legal agreement review are typically high-parameter proprietary LLMs from OpenAI, Anthropic, Meta, and Google. Below is a summary of leading models and their scores/rankings across key benchmarks, focusing on contract review-relevant tasks. I've used tables for clarity where multiple sub-tasks are involved.
LegalBench (Overall Legal Reasoning, Including Contract-Relevant Tasks like Interpretation and Issue-Spotting)
- Top models excel in nuanced legal analysis, which applies to agreement review.
Model | Overall Accuracy | Notes/Ranking |
---|---|---|
o1 Preview (OpenAI) | 81.7% | #1; Strong in rule application and free-response tasks. |
Llama 3.1 405B Instruct Turbo (Meta) | 79.0% | #2; Previous SOTA, cost-effective. |
Claude 3.5 Sonnet (Anthropic) | ~78β80% (tied for #3) | Balanced performance; excels in interpretation. |
GPT-4o (OpenAI) | ~78β80% (tied for #3) | High in issue-spotting. |
GPT-5 (OpenAI) | Not quantified, but reported as new SOTA | Excels across tasks; recent leader. |
CUAD (Clause Extraction in Contracts)
- Focuses on accuracy in pulling key elements like parties, dates, and governing law from agreements.
- Recent evaluations show proprietary models leading, but task-specific variation.
Task/Example | Top Model & Score | Other High Performers |
---|---|---|
Parties Involved | Llama 405B Instruct (Meta): 95.6% accuracy | GPT-4o (OpenAI): 95.4%; GPT-4o-mini: 95.2%. |
Document Name | GPT-4o-mini (OpenAI): 96.5% accuracy | GPT-4o: 94.9%; Claude 3.5 Sonnet: 93.7%. |
Governing Law | GPT-4o (OpenAI): 97.8% accuracy | Llama 405B: 95.0%; GPT-4o-mini: 85.8%. |
Effective Date | GPT-4o (OpenAI): 82.8% accuracy | Claude 3.5 Sonnet: 74.0%; GPT-4o-mini: 68.3%. |
Expiration Date | Llama 405B Instruct (Meta): 88.7% accuracy | Claude 3.5 Sonnet: 76.7%; GPT-4o: 69.7%. |
In broader CUAD-based tests (e.g., ContractEval), proprietary models like GPT-4o and Claude 3.5 Sonnet achieve high correctness (80β90% on clause risks), outperforming open-source models like Llama by 10β20% on average.
ContractLaw (Vals AI; Extraction, Matching, Correction)
- Overall top: Llama 3.1 405B Instruct Turbo (75.2% accuracy; #1, excels in extraction and correction).
- Claude 3 Opus (Anthropic): 74.0% (#2; strong in matching and correction).
- o1 Mini (OpenAI): High among OpenAI models; best on nuanced tasks but weaker on matching.
- GPT-4o Mini (OpenAI): Top budget model (#1 on matching; cost-effective).
- Recent updates: GPT 4.1 (OpenAI) leads related CaseLaw at 85.8%, but on ContractLaw, models like Grok 4 (66.0%) and Gemini 2.5 Pro Exp (64.7%) lag.
SpotDraft Benchmark (Contract Review, Summarization, Extraction)
- GPT-4 (OpenAI): #1 for risk comparison in review (highest recall/accuracy).
- o1-mini (OpenAI): #1 for playbook generation in review (deep reasoning).
- Gemini 1.5 Flash (Google): #1 for metadata extraction in summarization (high speed/accuracy for long contexts).
- GPT-4 Turbo (OpenAI): #1 for description generation in summarization (fast, accurate).
- GPT-4o mini (OpenAI): #1 for party extraction (highest F1 score, precision/recall).
In summary, OpenAI's GPT-4o family and o1 series, Anthropic's Claude 3.5 Sonnet/Opus, Meta's Llama 3.1 405B, and Google's Gemini 1.5 consistently rank highest across these benchmarks for legal agreement review. For cutting-edge performance, check platforms like Vals AI or Hugging Face for the latest updates, as model releases evolve rapidly. If you need evaluations for a specific task or model, provide more details!