Pass@K - chunhualiao/public-docs GitHub Wiki
The Pass@K metric is commonly used to evaluate the performance of code generation models. It measures the probability that a correct solution appears within the top ( K ) generated outputs for a given task. Here's a detailed breakdown:
What Pass@K Represents
- When a model is tasked with generating solutions (e.g., writing code to solve a programming problem), it may generate multiple outputs.
- Pass@K indicates the proportion of problems for which at least one of the ( K ) generated outputs is correct.
For example:
- Pass@1 measures whether the top solution is correct.
- Pass@5 measures whether any of the top 5 solutions is correct.
- Pass@10 measures whether any of the top 10 solutions is correct.
Formula
If a model generates ( n ) solutions per problem and we test ( m ) problems, Pass@K is calculated as:
$\text{Pass@K} = \frac{\text{Number of problems with a correct solution in top } K}{\text{Total number of problems tested}}$
Importance of Pass@K
- Reliability: Higher Pass@K means the model is more likely to produce a correct solution within ( K ) attempts.
- Practicality: In real-world scenarios, users often try multiple solutions, and Pass@K evaluates the likelihood of finding a correct one within ( K ) tries.
- Evaluation of Models: It is particularly useful for ranking models and understanding their performance across varying levels of output breadth.
Example
Suppose a model is tested on 100 programming tasks with ( K = 5 ), and for 85 of these tasks, at least one of the top 5 generated outputs is correct. The Pass@5 metric would be:
$\text{Pass@5} = \frac{85}{100} = 0.85 \text{ (or 85%)}$
This means that 85% of the time, the model can generate a correct solution within its top 5 attempts.
Contexts Where Pass@K is Used
- Code generation (e.g., in models like OpenAI Codex or GPT for programming tasks).
- Problem-solving models, such as those applied in competitive programming.
- Evaluating multi-output generation tasks where correctness is binary (pass/fail).
In summary, Pass@K provides a practical measure of a model's ability to solve tasks effectively, considering multiple tries.
The formula provided describes how to calculate the Pass@k metric in a probabilistic way, incorporating combinatorics and the expected value across problems.
Formula Breakdown
$\text{pass@k} := \mathbb{E}_{\text{Problems}}\left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right]$
Explanation of Terms:
-
(\mathbb{E}_{\text{Problems}}):
- Represents the expectation (average) over all the problems being evaluated.
- The metric is computed for multiple problems, and this part ensures the final result reflects the overall performance.
-
(n):
- The total number of solutions generated by the model for a single problem.
-
(c):
- The number of correct solutions (outputs) generated by the model for a single problem.
-
(k):
- The number of top solutions considered (e.g., top 1, top 5, etc.).
-
(\binom{n}{k}):
- The binomial coefficient, also read as "n choose k," represents the total number of ways to select (k) solutions from (n) outputs.
- $\binom{n}{k} = \frac{n!}{k!(n-k)!}$
-
(\binom{n-c}{k}):
- The binomial coefficient representing the number of ways to select (k) solutions excluding the correct ones. This assumes there are (n - c) incorrect solutions.
-
(1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}):
- This term computes the probability of at least one correct solution being in the top (k) selections.
- (\frac{\binom{n-c}{k}}{\binom{n}{k}}) gives the likelihood that all (k) selected solutions are incorrect. Subtracting this from 1 gives the likelihood that at least one of the selected solutions is correct.
-
Summing over Problems:
- For multiple problems, the expectation (\mathbb{E}_{\text{Problems}}) is computed by averaging this probability across all problems.
Intuition Behind the Formula:
- The metric focuses on the likelihood that a user, considering only the top (k) model-generated solutions, will find at least one correct answer.
- The formula accounts for cases where:
- (n): The model generates multiple possible answers.
- (c): Some answers are correct, and others are not.
- (k): The user only looks at the top (k) answers.
- The use of binomial coefficients ensures the combinatorial nature of selecting solutions is properly handled.
Example:
Suppose for one problem:
- The model generates (n = 10) solutions.
- There are (c = 2) correct solutions.
- You are evaluating Pass@k with (k = 3).
The probability of having at least one correct solution in the top 3 is calculated as:
$1 - \frac{\binom{10-2}{3}}{\binom{10}{3}}$
For multiple problems, this is averaged across all problem instances.
Importance of the Formula:
This formalization is particularly useful for evaluating models that output multiple solutions in contexts like programming or question-answering tasks. It ensures a mathematically rigorous method of estimating performance while accounting for variability in the correctness of outputs.