rejection sampling - chunhualiao/public-docs GitHub Wiki

What is Rejection Sampling in Reinforcement Learning?

Rejection sampling is a classic technique from probability and statistics that can also be applied in reinforcement learning (RL)—especially in off-policy settings. In a nutshell, you want to sample (draw experiences) from some target distribution (often corresponding to the target policy), but you only have access to samples from a different distribution (often the behavior policy).

General idea. You generate data using a proposal distribution—e.g., your behavior policy—and then decide whether to accept or reject each data point. The probability of acceptance is chosen to ensure the final, accepted data follows the target distribution.
Why do this in RL? In off-policy RL, you often gather experiences with a behavior policy (\mu(a \mid s)) but want to evaluate or improve a different policy (\pi(a \mid s)). A simple (though not always practical) way to ensure your samples match (\pi) is to accept each ((s, a)) pair from the replay buffer with probability

$P(\text{accept} \mid s, a) ;=; \min!\biggl(1, ;\frac{\pi(a \mid s)}{\mu(a \mid s)}\biggr)$

If accepted, the transition ((s, a, r, s')) is treated as if it came from (\pi). If rejected, it is discarded.

When (and Why) to Use It

Off-Policy Correction
- If you need unbiased on-policy data from an off-policy dataset. Rejection sampling provides a direct way to “filter out” transitions unlikely under the target policy.
- Helps mitigate the distribution mismatch between the policy that generated the data ((\mu)) and the policy of interest ((\pi)).
Theoretical Simplicity
- Rejection sampling is conceptually simple and has clear theoretical guarantees for producing unbiased samples from (\pi), provided (\mu(a \mid s) > 0) whenever (\pi(a \mid s) > 0).
Prototyping and Research
- Though not commonly used in large-scale production RL systems due to efficiency concerns (see below), it’s often used in research to illustrate core ideas in off-policy evaluation or policy optimization without immediately resorting to more complex methods.

Practical Considerations and Limitations

Low Acceptance Rate: If (\pi) differs significantly from (\mu), many samples will be rejected, which is highly wasteful. You may end up discarding most of your replay buffer.
High Variance / Data Inefficiency: If you only keep a small fraction of your data, the learning process can become very slow, or the estimates can have very high variance.
Importance Sampling Alternatives: In practice, people more commonly use importance sampling weights (and variants) rather than hard rejection. This allows all samples to be used, albeit with different weights, often stabilized by techniques like weighted importance sampling, per-decision importance sampling, or truncated / clipped importance weights.

In summary, rejection sampling is a straightforward but potentially inefficient way to filter off-policy experience so that it matches a target policy’s distribution. It is typically favored in theoretically oriented contexts or toy problems to ensure unbiasedness—but in real-world, large-scale RL problems, it is often replaced by more data-efficient methods (importance sampling, doubly robust estimators, etc.).