AI‐Model Benchmark - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

Basic understanding

LLM benchmarks consist of different tests that evaluate a model. For this, some sample data is given with some parameters on how the test is to be run, and the scoring can be determined by different measures such as accuracy (choosing correctly from a given set), recall (remembering past information), perplexity (how well the model predicts), and more. The score is given in the range of 0-100.


Multiple choice evaluations

  • A common parameter in this area is "shot". This refers to providing an amount of question-answer sets as additional context for the actual question. 0-shot and few-shot with a number are commonly seen.

Multiple choice Datasets

MMLU dataset

  • Tests multitask accuracy in humanities, social sciences, STEM and other subjects.

  • Questions from college admissions, exam practice questions, university reader questions and others

  • Questions fall in different categories: "Elementary", "High School", "College" and "Professional".

  • Human accuracy is around "33" for unspecialised and expert-level accuracy is "89.8".

  • 15908 questions – 100 questions for each subject

  • Few-shot = 5

  • The average of all subjects is taken as the final score.

ARC dataset

  • 7787 natural science questions for use on standardised tests, non-diagram, multiple choice
  • 2590 hard questions (that fail for retrieval and co-occurrence methods), 5197 easy
  • levels from 3rd grade to 9th grade, filtered from US templates
  • reasoning types: analogy, spatial, explanation, counterfactual, algebraic, comparison, logic, linguistic

Usage

Eleuther AI LM Evaluation Harness

  • Download the library and use it via the command line.
  • More info on arguments
  • Example: lm_eval --model hf --model_args pretrained=_huggingface path or local path_, dtype="float16" --tasks winograde, lambda_openai, arc_easy --batch_size 16 ...
  • Many specifications can be set, like where to find the model, the tasks, i.e., evaluations to run, how to configure the test and many more.
  • Evaluations can be run in parallel, and many more configurations are possible.

Problem: Inconsistent Testing prerequisites

  • Different implementations of how the questions from a dataset are presented may yield different results and scores.

Is a topic line included? Are keywords like "Question" and "Choices" used?

  • In few-shot cases, the order of the context examples might differ, resulting in different preferences for answering.
  • If the evaluation is done in terms of "most probable out of A, B, C, D" or if the probabilities for the correct answer, concerning all answers, are taken, the evaluation may also differ. [Fourrier]

BLEU and ROUGE

Both uses n-gram co-occurrence statistics to evaluate how well a sample matches a reference. [Clement]

Bilingual Evaluation Understudy (BLEU)

  • Automatically evaluate machine translation by comparing how well the words match a human translation.

ROUGE

  • Measure how well a text is summarised.

Relevance for us

  • We can use similar formulas to evaluate how close the generated test code comes to our own example.

Other evaluations

SonarCube or SonarCloud

  • Collecting code metrics with platforms
  • Using LeetCode (or other code-training platforms) to generate test cases for their examples and seeing how many passed
  • Evaluating complexity in terms of number of methods, cyclomatic complexity and cognitive complexity
  • How many bugs can be detected, and how many vulnerabilities can be detected? [Tosi]

Problems:

Bias

  • query bias – not comprehensive or appropriately distributed
  • grading bias
  • generalisation bias – overfitting evaluation data

Automatic benchmarking (MMLU, MT-Bench)

  • Ground-truth-benchmarking – not enough nuance, impartial grading
  • open-ended using LLMs as graders - grading bias and preference bias
  • contamination and generalisation

large-scale user-facing benchmarks (Chatbot Arena)

  • collect a vast array of real-world user queries
  • mitigates generalisation due to crowd wisdom
  • less noise, because large samples -> little grading bias
  • expensive, slow, irreproducible
  • not publicly accessible [Ni]

Useful references

Edirisinghe White Ni Lucek Fourrier Clement Tosi