reinforcement learning:policy - chunhualiao/public-docs GitHub Wiki

on-policy-vs.-off-policy-learning

A policy in RL is a strategy that an agent uses to decide which action to take in a given state. It can be deterministic or stochastic, and it can range from random to optimal. The goal of RL algorithms is to learn an optimal policy that maximizes the cumulative reward.

[state]--{Policy}---> which action?

In reinforcement learning, a policy is a mapping from states to actions (possibly stochastic) that describes how an agent behaves at any given time.

When we talk about behavior and target policies, it’s often in the context of off-policy learning,

where the data (experience) is generated by one policy (the behavior policy),
but we want to evaluate or improve a different policy (the target policy).

In reinforcement learning (RL), policy and policy network are fundamental concepts. Here's what they mean:

✅ What is a Policy?

A policy is a strategy used by an agent to decide what action to take in a given state.

Formally, a policy is a mapping from states to actions:

$$ \pi(a | s) = \text{Probability of taking action } a \text{ in state } s $$
It can be:
- Deterministic: Always picks the same action for a given state.
  
  $$ a = \pi(s) $$
- Stochastic: Chooses actions according to a probability distribution.

The goal in RL is to learn an optimal policy that maximizes expected cumulative reward over time (called the "return").

✅ What is a Policy Network?

A policy network is a neural network that represents or approximates the policy $\pi(a | s)$.

It takes in a state as input.
It outputs:
- Either the probability distribution over actions (stochastic policy), or
- A specific action (deterministic policy).

Depending on the environment and algorithm, the output might be:

A softmax layer for discrete action spaces
Mean and variance for continuous actions (e.g., in Gaussian policies)

🧠 Example

Imagine a self-driving car agent:

The state could be sensor readings, position, speed, etc.
The action could be: turn left, go straight, or turn right.

Policy:

$\pi(\text{turn left} | \text{state}) = 0.1$
$\pi(\text{go straight} | \text{state}) = 0.8$
$\pi(\text{turn right} | \text{state}) = 0.1$

Policy Network:

A neural net that takes the state as input and outputs these probabilities.

🔄 Relationship to Value-Based Methods

In policy-based methods (like REINFORCE or PPO), the policy is learned directly.
In value-based methods (like Q-learning), the policy is indirect: actions are chosen based on estimated value functions.

Behavior Policy

Definition: The behavior policy, usually denoted (\mu), is the policy that is actually used to interact with the environment and collect data (state, action, reward, next state).
Purpose: You might choose a behavior policy that explores the environment more broadly (e.g., (\epsilon)-greedy) so that you can learn about many different states and actions.
Examples:
- A random policy, which selects actions uniformly at random.
- An (\epsilon)-greedy policy, where most of the time you select the best-known action, but with probability (\epsilon) you choose an action at random to ensure continued exploration.
Key Point: The experiences you observe (the transitions ((s, a, r, s'))) come from (\mu). If (\mu\neq\pi), it means your dataset may not directly reflect what would happen under the policy (\pi) you actually want to evaluate or improve.

Target Policy

Definition: The target policy, usually denoted (\pi), is the policy you ultimately want to evaluate or improve.
Purpose: In many RL algorithms—especially off-policy algorithms—the agent’s goal is to learn or evaluate (\pi), but does not necessarily generate data using (\pi).
Examples:
- A greedy policy with respect to the agent’s current value function or Q-function.
- A policy derived from a more complex policy class that you are gradually fitting through gradient-based updates.
Key Point: Even though the agent collects data using (\mu), the learning objective focuses on (\pi). We want accurate estimates of how (\pi) would perform or to find the best (\pi) given the data.

Relationship and Off-Policy Learning

Off-Policy: “Off-policy” methods allow learning about (i.e., evaluating or improving) a target policy (\pi) using data generated by a different behavior policy (\mu).
Why This Matters:
- (\mu) may explore the environment more broadly than (\pi).
- (\pi) might be improved based on the variety of experiences generated by (\mu).
- Ensuring correctness typically requires techniques like importance sampling or rejection sampling (in simpler approaches) to account for the difference between (\mu) and (\pi).

In summary, the behavior policy is what you use to explore and gather experiences; the target policy is what you really care about evaluating or improving. Off-policy learning methods bridge this gap between how you gather data and which policy you want to learn about.