AI‐Model Benchmark - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

Basic understanding

LLM benchmarks consist of different tests that evaluate a model. For this, some sample data is given with some parameters on how the test is to be run, and the scoring can be determined by different measures such as accuracy (choosing correctly from a given set), recall (remembering past information), perplexity (how well the model predicts), and more. The score is given in the range of 0-100.

Multiple choice evaluations

A common parameter in this area is "shot". This refers to providing an amount of question-answer sets as additional context for the actual question. 0-shot and few-shot with a number are commonly seen.

Multiple choice Datasets

MMLU dataset

Tests multitask accuracy in humanities, social sciences, STEM and other subjects.
Questions from college admissions, exam practice questions, university reader questions and others
Questions fall in different categories: "Elementary", "High School", "College" and "Professional".
Human accuracy is around "33" for unspecialised and expert-level accuracy is "89.8".
15908 questions – 100 questions for each subject
Few-shot = 5
The average of all subjects is taken as the final score.

ARC dataset

7787 natural science questions for use on standardised tests, non-diagram, multiple choice
2590 hard questions (that fail for retrieval and co-occurrence methods), 5197 easy
levels from 3rd grade to 9th grade, filtered from US templates
reasoning types: analogy, spatial, explanation, counterfactual, algebraic, comparison, logic, linguistic

Usage

Eleuther AI LM Evaluation Harness

Download the library and use it via the command line.
More info on arguments
Example: lm_eval --model hf --model_args pretrained=_huggingface path or local path_, dtype="float16" --tasks winograde, lambda_openai, arc_easy --batch_size 16 ...
Many specifications can be set, like where to find the model, the tasks, i.e., evaluations to run, how to configure the test and many more.
Evaluations can be run in parallel, and many more configurations are possible.

Problem: Inconsistent Testing prerequisites

Different implementations of how the questions from a dataset are presented may yield different results and scores.

Is a topic line included? Are keywords like "Question" and "Choices" used?

In few-shot cases, the order of the context examples might differ, resulting in different preferences for answering.
If the evaluation is done in terms of "most probable out of A, B, C, D" or if the probabilities for the correct answer, concerning all answers, are taken, the evaluation may also differ. [Fourrier]

BLEU and ROUGE

Both uses n-gram co-occurrence statistics to evaluate how well a sample matches a reference. [Clement]

Bilingual Evaluation Understudy (BLEU)

Automatically evaluate machine translation by comparing how well the words match a human translation.

ROUGE

Measure how well a text is summarised.

Relevance for us

We can use similar formulas to evaluate how close the generated test code comes to our own example.

Other evaluations

SonarCube or SonarCloud

Collecting code metrics with platforms
Using LeetCode (or other code-training platforms) to generate test cases for their examples and seeing how many passed
Evaluating complexity in terms of number of methods, cyclomatic complexity and cognitive complexity
How many bugs can be detected, and how many vulnerabilities can be detected? [Tosi]

Problems:

Bias

query bias – not comprehensive or appropriately distributed
grading bias
generalisation bias – overfitting evaluation data

Automatic benchmarking (MMLU, MT-Bench)

Ground-truth-benchmarking – not enough nuance, impartial grading
open-ended using LLMs as graders - grading bias and preference bias
contamination and generalisation

large-scale user-facing benchmarks (Chatbot Arena)

collect a vast array of real-world user queries
mitigates generalisation due to crowd wisdom
less noise, because large samples -> little grading bias
expensive, slow, irreproducible
not publicly accessible [Ni]

Useful references

Edirisinghe White Ni Lucek Fourrier Clement Tosi