train_tiny_zero.sh - chunhualiao/public-docs GitHub Wiki

explain the command below in details, including what each option does:

The script scripts/train_tiny_zero.sh is used to train a model using the PPO (Proximal Policy Optimization) algorithm. It executes the verl.trainer.main_ppo

python3 -m verl.trainer.main_ppo \
data.train_files=$DATA_DIR/train.parquet \
data.val_files=$DATA_DIR/test.parquet \
data.train_batch_size=256 \
data.val_batch_size=1312 \
data.max_prompt_length=256 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=$BASE_MODEL \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=128 \
actor_rollout_ref.actor.ppo_micro_batch_size=8 \
actor_rollout_ref.rollout.log_prob_micro_batch_size=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
critic.optim.lr=1e-5 \
critic.model.path=$BASE_MODEL \
critic.ppo_micro_batch_size=8 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=['wandb'] \
trainer.val_before_train=False \
trainer.default_hdfs_dir=null \
trainer.n_gpus_per_node=$N_GPUS \
trainer.nnodes=1 \
trainer.save_freq=100 \
trainer.test_freq=100 \
trainer.project_name=TinyZero \
trainer.experiment_name=$EXPERIMENT_NAME \
trainer.total_epochs=15 2>&1 | tee verl_demo.log

Here's a detailed breakdown of the command and its components:

Base Command

python3 -m verl.trainer.main_ppo

Purpose: Runs the Proximal Policy Optimization (PPO) training script from the verl library.
-m: Executes the specified module (verl.trainer.main_ppo) as a script.

Data Configuration

Parameter	Value	Description
`data.train_files`	`$DATA_DIR/train.parquet`	Path to training data (Parquet format).
`data.val_files`	`$DATA_DIR/test.parquet`	Path to validation/evaluation data.
`data.train_batch_size`	`256`	Batch size for training.
`data.val_batch_size`	`1312`	Batch size for validation.
`data.max_prompt_length`	`256`	Truncate prompts to this token length.
`data.max_response_length`	`1024`	Truncate responses to this token length.

Actor/Rollout Configuration

Parameter	Value	Description
`actor_rollout_ref.model.path`	`$BASE_MODEL`	Path to the base model (e.g., HuggingFace model) used for the actor.
`actor_rollout_ref.actor.optim.lr`	`1e-6`	Learning rate for the actor’s optimizer.
`actor_rollout_ref.actor.ppo_mini_batch_size`	`128`	Mini-batch size for PPO updates.
`actor_rollout_ref.actor.ppo_micro_batch_size`	`8`	Micro-batch size for gradient accumulation.
`actor_rollout_ref.rollout.log_prob_micro_batch_size`	`8`	Batch size for computing log probabilities during rollouts.
`actor_rollout_ref.rollout.tensor_model_parallel_size`	`$ROLLOUT_TP_SIZE`	Tensor parallelism degree for distributed inference.
`actor_rollout_ref.rollout.gpu_memory_utilization`	`0.4`	Fraction of GPU memory reserved for rollouts (prevents OOM).
`actor_rollout_ref.ref.log_prob_micro_batch_size`	`4`	Batch size for reference model log probability calculations.

Critic Configuration

Parameter	Value	Description
`critic.optim.lr`	`1e-5`	Learning rate for the critic’s optimizer.
`critic.model.path`	`$BASE_MODEL`	Base model for the critic (often same as actor).
`critic.ppo_micro_batch_size`	`8`	Micro-batch size for critic updates.

Algorithm Configuration

Parameter	Value	Description
`algorithm.kl_ctrl.kl_coef`	`0.001`	Coefficient for KL divergence penalty (controls deviation from the reference policy).

Training Infrastructure & Logging

Parameter	Value	Description
`trainer.logger`	`['wandb']`	Log metrics to Weights & Biases (W&B).
`trainer.val_before_train`	`False`	Skip validation before starting training.
`trainer.default_hdfs_dir`	`null`	Disable HDFS storage (using local/NERSC storage).
`trainer.n_gpus_per_node`	`$N_GPUS`	Number of GPUs per node (e.g., `4` on Perlmutter).
`trainer.nnodes`	`1`	Number of nodes (1 node = 4 A100 GPUs on Perlmutter).
`trainer.save_freq`	`100`	Save checkpoint every 100 steps.
`trainer.test_freq`	`100`	Run validation every 100 steps.
`trainer.project_name`	`TinyZero`	W&B project name.
`trainer.experiment_name`	`$EXPERIMENT_NAME`	Unique identifier for this run.
`trainer.total_epochs`	`15`	Total training epochs.

Logging & Output

2>&1 | tee verl_demo.log

2>&1: Redirects stderr to stdout.
tee verl_demo.log: Saves all output to verl_demo.log while printing to the terminal.

Key Concepts

PPO Workflow:
- Actor: Generates actions (responses) during rollouts.
- Critic: Evaluates the value of states (responses).
- KL Control: Penalizes large deviations from the reference policy to stabilize training.
Batch Sizes:
- Micro-batch: Smaller batches for gradient accumulation (memory optimization).
- Mini-batch: Subsets of the full batch for PPO updates.
Distributed Training:
- tensor_model_parallel_size: Splits the model across GPUs (e.g., for large models).
Checkpointing:
- save_freq=100: Frequent saves allow resuming from interruptions.

Example Workflow on Perlmutter

# Set environment variables
export DATA_DIR=/global/cscratch1/sd/yourname/data
export BASE_MODEL=EleutherAI/pythia-1.4b
export ROLLOUT_TP_SIZE=2
export N_GPUS=4
export EXPERIMENT_NAME=run1

# Submit job
sbatch job_script.sh

For more details, refer to:

Step-by-Step Execution of `scripts/train_tiny_zero.sh`

Overview

The script scripts/train_tiny_zero.sh runs the verl.trainer.main_ppo module with a set of configuration parameters. These parameters define data paths, model paths, batch sizes, learning rates, and other training settings. Environment variables like $DATA_DIR, $BASE_MODEL, etc., are used to parameterize the script.

1. Script Execution and Configuration

When executed, the script:

Loads configuration settings (model, dataset, training hyperparameters).
Starts the PPO training loop using RayPPOTrainer.fit().
Iterates over the training data, updating the policy and value networks.

2. Data Loading

The dataset is loaded using RLHFDataset (in verl/utils/dataset/rl_dataset.py).
Example raw data entry before processing by __getitem__:

{
  "prompt": [{"role": "user", "content": "Explain the theory of relativity."}],
  "response": "...",
  "data_source": "some_dataset",
  "extra_info": {"index": 123, "some_other_info": "value"}
}

Processed Batch (After `collate_fn`)

After tokenization and processing, a batch (with batch size = 1 for simplicity) looks like:

batch_dict = {
  'input_ids': torch.tensor([...token IDs for prompt...](/chunhualiao/public-docs/wiki/...token-IDs-for-prompt...)),  # Tokenized prompt
  'attention_mask': torch.tensor([1, 1, 1, ..., 1](/chunhualiao/public-docs/wiki/1,-1,-1,-...,-1)),  # Attention mask
  'position_ids': torch.tensor([0, 1, 2, ..., ...](/chunhualiao/public-docs/wiki/0,-1,-2,-...,-...)),  # Position IDs
  'response': '...',
  'data_source': 'some_dataset',
  'raw_prompt': [{'role': 'user', 'content': 'Explain the theory of relativity.'}],
  'index': 123,
  'extra_info': {'index': 123, 'some_other_info': 'value'}
}

3. Training Loop (`RayPPOTrainer.fit()`)

The fit() method contains the main PPO training loop.

Step 1: Data Preparation

Converts batch_dict into a DataProto object.
Extracts generation-related inputs.

batch: DataProto = DataProto.from_single_dict(batch_dict)
gen_batch = batch.pop(batch_keys=['input_ids', 'attention_mask', 'position_ids'])

Now, gen_batch contains only the data needed for response generation:

gen_batch = {
 'input_ids': torch.tensor([...token IDs for prompt...](/chunhualiao/public-docs/wiki/...token-IDs-for-prompt...)),
 'attention_mask': torch.tensor([1, 1, 1, ..., 1](/chunhualiao/public-docs/wiki/1,-1,-1,-...,-1)),
 'position_ids': torch.tensor([0, 1, 2, ..., ...](/chunhualiao/public-docs/wiki/0,-1,-2,-...,-...))
}

Step 2: Generation (Rollout)

The actor network generates a response to the prompt.

gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)

Assume the generated response is "General relativity explains gravity...".
The output (gen_batch_output) contains:

gen_batch_output = {
    'responses': torch.tensor([...token IDs for response...](/chunhualiao/public-docs/wiki/...token-IDs-for-response...)),  # Tokenized response
    'old_log_probs': torch.tensor([-2.5, -1.8, ..., -3.2](/chunhualiao/public-docs/wiki/-2.5,--1.8,-...,--3.2)),  # Log probabilities
    'attention_mask': torch.tensor([1, 1, 1, ..., 0, 0](/chunhualiao/public-docs/wiki/1,-1,-1,-...,-0,-0))  # Attention mask
}

The batch is updated with new response data:

batch.batch['uid'] = np.array([str(uuid.uuid4()) for _ in range(len(batch.batch))], dtype=object)
batch = batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=True)
batch = batch.union(gen_batch_output)

Step 3: Sequence Length Balancing

Balances batch sequence lengths across processes:

self._balance_batch(batch, metrics=metrics)

Step 4: Reference Policy Log Probabilities (Optional)

If a reference policy is used, its log probabilities for the generated response are computed:

if self.use_reference_policy:
    ref_log_prob = self.ref_policy_wg.compute_ref_log_prob(batch)
    batch = batch.union(ref_log_prob)

Example log probabilities:

ref_log_prob = {
    'ref_log_prob': torch.tensor([-2.7, -1.9, ..., -3.0](/chunhualiao/public-docs/wiki/-2.7,--1.9,-...,--3.0))
}

Step 5: Value Estimation

The critic network estimates state values (prompt + response).

if self.use_critic:
    values = self.critic_wg.compute_values(batch)
    batch = batch.union(values)

Example estimated values:

values = {
    'values': torch.tensor([-0.1, 0.2, ..., 0.5](/chunhualiao/public-docs/wiki/-0.1,-0.2,-...,-0.5))
}

Step 6: Reward Calculation and Advantage Estimation

Computes reward using the reward function.

reward_tensor = self.reward_fn(batch)
batch.batch['token_level_scores'] = reward_tensor

Example rewards:

reward_tensor = torch.tensor([0.1, 0.3, 0.5, ..., 0.8](/chunhualiao/public-docs/wiki/0.1,-0.3,-0.5,-...,-0.8))

Applies KL penalty (if needed):

if not self.config.actor_rollout_ref.actor.use_kl_loss:
    batch, kl_metrics = apply_kl_penalty(batch, kl_ctrl=self.kl_ctrl, kl_penalty=self.config.algorithm.kl_penalty)
    metrics.update(kl_metrics)
else:
    batch.batch['token_level_rewards'] = batch.batch['token_level_scores']

Computes advantage function (GAE or GRPO):

batch = compute_advantage(batch, ...)

Example computed values:

batch.batch['advantages'] = torch.tensor([0.05, 0.1, ..., 0.2](/chunhualiao/public-docs/wiki/0.05,-0.1,-...,-0.2))  # Example advantages
batch.batch['returns'] = torch.tensor([0.15, 0.4, ..., 1.0](/chunhualiao/public-docs/wiki/0.15,-0.4,-...,-1.0))  # Example returns

Step 7: Critic Update

The critic network updates based on returns.

if self.use_critic:
    critic_output = self.critic_wg.update_critic(batch)

Step 8: Actor Update

The policy network (actor) is updated.

if self.config.trainer.critic_warmup <= self.global_steps:
    actor_output = self.actor_rollout_wg.update_actor(batch)

Step 9: Metrics and Logging

Logs key metrics (policy loss, value loss, KL divergence).

Step 10: Checkpointing (Periodically)

Saves model weights periodically.

Summary

Script Execution loads configurations and starts the PPO training loop.
Data Loading converts input JSON into tokenized tensors.
Training Loop:
- Data Preparation: Extracts required tensors for training.
- Response Generation: Uses actor network.
- Reference Policy Check (Optional): Computes log probabilities.
- Value Estimation: Uses critic network.
- Reward Calculation & Advantage Computation.
- Critic and Actor Updates.
- Logging and Checkpointing.

Final Notes

The entire process repeats for multiple batches and epochs.
Over time, the model learns to generate better responses by maximizing rewards and improving the policy.

🚀 This structured approach ensures stability and performance in PPO training. 🎯

Does the Code Use Group Relative Policy Optimization (GRPO) or Classic PPO?

Summary

The scripts/train_tiny_zero.sh script is configured to use classic PPO (Proximal Policy Optimization) with Generalized Advantage Estimation (GAE), not Group Relative Policy Optimization (GRPO).

Analysis of the Code

The RayPPOTrainer class in verl/trainer/ppo/ray_trainer.py determines whether to use GAE (classic PPO) or GRPO based on the configuration parameter algorithm.adv_estimator.

Key Findings:

Default Configuration (verl/trainer/config/ppo_trainer.yaml)
- The parameter algorithm.adv_estimator is set to "gae" by default.
- The script train_tiny_zero.sh does not override this setting.
- Therefore, the default behavior is classic PPO with GAE.
Additional GRPO-Related Parameters
- actor_rollout_ref.rollout.n: 1
  - Comment in config: # > 1 for grpo
  - train_tiny_zero.sh does not override this.
- actor_rollout_ref.actor.use_kl_loss: False
  - Comment in config: # True for GRPO
  - train_tiny_zero.sh does not override this.
How the Decision is Made in Code
- The function compute_advantage is called in RayPPOTrainer.fit().
- compute_advantage calls either:
  - compute_gae_advantage_return (used in classic PPO with GAE).
  - compute_grpo_outcome_advantage (used for GRPO).
- The choice is determined by algorithm.adv_estimator.
- Since the default setting is "gae", the script uses classic PPO.

Conclusion

Although the codebase supports GRPO, the scripts/train_tiny_zero.sh script does not enable it.
To enable GRPO, the script would need to be modified to:

algorithm:
  adv_estimator: grpo  # Change from 'gae' to 'grpo'
actor_rollout_ref:
  rollout:
    n: >1  # Set to a value greater than 1
  actor:
    use_kl_loss: True  # Enable KL loss

🚀 As provided, TinyZero uses Classic PPO, not GRPO. 🎯

train_tiny_zero.sh - chunhualiao/public-docs GitHub Wiki

Base Command

Data Configuration

Actor/Rollout Configuration

Critic Configuration

Algorithm Configuration

Training Infrastructure & Logging

Logging & Output

Key Concepts

Example Workflow on Perlmutter

Step-by-Step Execution of scripts/train_tiny_zero.sh

Overview

1. Script Execution and Configuration

2. Data Loading

Processed Batch (After collate_fn)

3. Training Loop (RayPPOTrainer.fit())

Step 1: Data Preparation

Step 2: Generation (Rollout)

Step 3: Sequence Length Balancing

Step 4: Reference Policy Log Probabilities (Optional)

Step 5: Value Estimation

Step 6: Reward Calculation and Advantage Estimation

Step 7: Critic Update

Step 8: Actor Update

Step 9: Metrics and Logging

Step 10: Checkpointing (Periodically)

Summary

Final Notes

Does the Code Use Group Relative Policy Optimization (GRPO) or Classic PPO?

Summary

Analysis of the Code

Key Findings:

Conclusion

Step-by-Step Execution of `scripts/train_tiny_zero.sh`

Processed Batch (After `collate_fn`)

3. Training Loop (`RayPPOTrainer.fit()`)