Pass@K - chunhualiao/public-docs GitHub Wiki

The Pass@K metric is commonly used to evaluate the performance of code generation models. It measures the probability that a correct solution appears within the top ( K ) generated outputs for a given task. Here's a detailed breakdown:

What Pass@K Represents

  • When a model is tasked with generating solutions (e.g., writing code to solve a programming problem), it may generate multiple outputs.
  • Pass@K indicates the proportion of problems for which at least one of the ( K ) generated outputs is correct.

For example:

  • Pass@1 measures whether the top solution is correct.
  • Pass@5 measures whether any of the top 5 solutions is correct.
  • Pass@10 measures whether any of the top 10 solutions is correct.

Formula

If a model generates ( n ) solutions per problem and we test ( m ) problems, Pass@K is calculated as:

$\text{Pass@K} = \frac{\text{Number of problems with a correct solution in top } K}{\text{Total number of problems tested}}$

Importance of Pass@K

  1. Reliability: Higher Pass@K means the model is more likely to produce a correct solution within ( K ) attempts.
  2. Practicality: In real-world scenarios, users often try multiple solutions, and Pass@K evaluates the likelihood of finding a correct one within ( K ) tries.
  3. Evaluation of Models: It is particularly useful for ranking models and understanding their performance across varying levels of output breadth.

Example

Suppose a model is tested on 100 programming tasks with ( K = 5 ), and for 85 of these tasks, at least one of the top 5 generated outputs is correct. The Pass@5 metric would be:

$\text{Pass@5} = \frac{85}{100} = 0.85 \text{ (or 85%)}$

This means that 85% of the time, the model can generate a correct solution within its top 5 attempts.

Contexts Where Pass@K is Used

  • Code generation (e.g., in models like OpenAI Codex or GPT for programming tasks).
  • Problem-solving models, such as those applied in competitive programming.
  • Evaluating multi-output generation tasks where correctness is binary (pass/fail).

In summary, Pass@K provides a practical measure of a model's ability to solve tasks effectively, considering multiple tries.

The formula provided describes how to calculate the Pass@k metric in a probabilistic way, incorporating combinatorics and the expected value across problems.

Formula Breakdown

$\text{pass@k} := \mathbb{E}_{\text{Problems}}\left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right]$

Explanation of Terms:

  1. (\mathbb{E}_{\text{Problems}}):

    • Represents the expectation (average) over all the problems being evaluated.
    • The metric is computed for multiple problems, and this part ensures the final result reflects the overall performance.
  2. (n):

    • The total number of solutions generated by the model for a single problem.
  3. (c):

    • The number of correct solutions (outputs) generated by the model for a single problem.
  4. (k):

    • The number of top solutions considered (e.g., top 1, top 5, etc.).
  5. (\binom{n}{k}):

    • The binomial coefficient, also read as "n choose k," represents the total number of ways to select (k) solutions from (n) outputs.
    • $\binom{n}{k} = \frac{n!}{k!(n-k)!}$
  6. (\binom{n-c}{k}):

    • The binomial coefficient representing the number of ways to select (k) solutions excluding the correct ones. This assumes there are (n - c) incorrect solutions.
  7. (1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}):

    • This term computes the probability of at least one correct solution being in the top (k) selections.
    • (\frac{\binom{n-c}{k}}{\binom{n}{k}}) gives the likelihood that all (k) selected solutions are incorrect. Subtracting this from 1 gives the likelihood that at least one of the selected solutions is correct.
  8. Summing over Problems:

    • For multiple problems, the expectation (\mathbb{E}_{\text{Problems}}) is computed by averaging this probability across all problems.

Intuition Behind the Formula:

  • The metric focuses on the likelihood that a user, considering only the top (k) model-generated solutions, will find at least one correct answer.
  • The formula accounts for cases where:
    • (n): The model generates multiple possible answers.
    • (c): Some answers are correct, and others are not.
    • (k): The user only looks at the top (k) answers.
  • The use of binomial coefficients ensures the combinatorial nature of selecting solutions is properly handled.

Example:

Suppose for one problem:

  • The model generates (n = 10) solutions.
  • There are (c = 2) correct solutions.
  • You are evaluating Pass@k with (k = 3).

The probability of having at least one correct solution in the top 3 is calculated as:

$1 - \frac{\binom{10-2}{3}}{\binom{10}{3}}$

For multiple problems, this is averaged across all problem instances.

Importance of the Formula:

This formalization is particularly useful for evaluating models that output multiple solutions in contexts like programming or question-answering tasks. It ensures a mathematically rigorous method of estimating performance while accounting for variability in the correctness of outputs.