TinyZero:Action - chunhualiao/public-docs GitHub Wiki

TinyZero

Actions are token predictions from the LLM. The action space is the vocabulary size of the model. Actions are sampled using the model's output logits during generation.

Action Space Implementation in veRL

Based on the code I've examined, particularly the NaiveRollout implementation, I can now explain how the action space is concretely implemented in veRL.

Task Completed

The action space in veRL is implemented through token sampling in the generation process, specifically:

1. Action Space Definition:

The action space is implicitly defined as the model's vocabulary space.
In NaiveRollout.generate_sequences(), this is represented by the logits output from the model, which has shape (batch_size, vocab_size).
Each logit represents the model's probability of selecting that vocabulary token as the next action.

2. Action Selection Implementation:

# In naive_rollout.py
logits = logits[:, -1, :] / self.config.temperature
probs = F.softmax(logits, dim=-1)

if self.config.do_sample:
    # Stochastic action selection through multinomial sampling
    idx_next = torch.multinomial(probs, num_samples=1)
else:
    # Deterministic action selection through argmax
    idx_next = torch.argmax(probs, dim=-1, keepdim=True)

3. Action Space Control:

Temperature scaling: Controls the randomness of the action distribution.
Top-k filtering: Limits the action space to only the k most likely tokens.

if self.config.top_k is not None:
    v, _ = torch.topk(logits, min(self.config.top_k, logits.size(-1)))
    logits[logits < v[:, [-1]]] = -float('Inf')

4. Action Tracking:

The selected actions (tokens) are tracked in log_probs for policy optimization.
The log probabilities are computed using logprobs_from_logits() for each selected action.

log_probs = logprobs_from_logits(logits=logits, labels=response)

Summary

This implementation allows veRL to:

Sample actions (tokens) from the policy network (LLM).
Control the exploration-exploitation tradeoff through temperature and top-k.
Track action probabilities for policy gradient updates.
Support both stochastic and deterministic action selection.

The action space is therefore continuous in the sense of token probabilities, but discrete in the actual token selection, making it suitable for policy gradient methods like PPO while maintaining the language model's fundamental token-by-token generation process.