train_tiny_zero.sh - chunhualiao/public-docs GitHub Wiki

TinyZero>

explain the command below in details, including what each option does:

The script scripts/train_tiny_zero.sh is used to train a model using the PPO (Proximal Policy Optimization) algorithm. It executes the verl.trainer.main_ppo

python3 -m verl.trainer.main_ppo \
data.train_files=$DATA_DIR/train.parquet \
data.val_files=$DATA_DIR/test.parquet \
data.train_batch_size=256 \
data.val_batch_size=1312 \
data.max_prompt_length=256 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=$BASE_MODEL \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=128 \
actor_rollout_ref.actor.ppo_micro_batch_size=8 \
actor_rollout_ref.rollout.log_prob_micro_batch_size=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
critic.optim.lr=1e-5 \
critic.model.path=$BASE_MODEL \
critic.ppo_micro_batch_size=8 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=['wandb'] \
trainer.val_before_train=False \
trainer.default_hdfs_dir=null \
trainer.n_gpus_per_node=$N_GPUS \
trainer.nnodes=1 \
trainer.save_freq=100 \
trainer.test_freq=100 \
trainer.project_name=TinyZero \
trainer.experiment_name=$EXPERIMENT_NAME \
trainer.total_epochs=15 2>&1 | tee verl_demo.log

Here's a detailed breakdown of the command and its components:


Base Command

python3 -m verl.trainer.main_ppo
  • Purpose: Runs the Proximal Policy Optimization (PPO) training script from the verl library.
  • -m: Executes the specified module (verl.trainer.main_ppo) as a script.

Data Configuration

Parameter Value Description
data.train_files $DATA_DIR/train.parquet Path to training data (Parquet format).
data.val_files $DATA_DIR/test.parquet Path to validation/evaluation data.
data.train_batch_size 256 Batch size for training.
data.val_batch_size 1312 Batch size for validation.
data.max_prompt_length 256 Truncate prompts to this token length.
data.max_response_length 1024 Truncate responses to this token length.

Actor/Rollout Configuration

Parameter Value Description
actor_rollout_ref.model.path $BASE_MODEL Path to the base model (e.g., HuggingFace model) used for the actor.
actor_rollout_ref.actor.optim.lr 1e-6 Learning rate for the actor’s optimizer.
actor_rollout_ref.actor.ppo_mini_batch_size 128 Mini-batch size for PPO updates.
actor_rollout_ref.actor.ppo_micro_batch_size 8 Micro-batch size for gradient accumulation.
actor_rollout_ref.rollout.log_prob_micro_batch_size 8 Batch size for computing log probabilities during rollouts.
actor_rollout_ref.rollout.tensor_model_parallel_size $ROLLOUT_TP_SIZE Tensor parallelism degree for distributed inference.
actor_rollout_ref.rollout.gpu_memory_utilization 0.4 Fraction of GPU memory reserved for rollouts (prevents OOM).
actor_rollout_ref.ref.log_prob_micro_batch_size 4 Batch size for reference model log probability calculations.

Critic Configuration

Parameter Value Description
critic.optim.lr 1e-5 Learning rate for the critic’s optimizer.
critic.model.path $BASE_MODEL Base model for the critic (often same as actor).
critic.ppo_micro_batch_size 8 Micro-batch size for critic updates.

Algorithm Configuration

Parameter Value Description
algorithm.kl_ctrl.kl_coef 0.001 Coefficient for KL divergence penalty (controls deviation from the reference policy).

Training Infrastructure & Logging

Parameter Value Description
trainer.logger ['wandb'] Log metrics to Weights & Biases (W&B).
trainer.val_before_train False Skip validation before starting training.
trainer.default_hdfs_dir null Disable HDFS storage (using local/NERSC storage).
trainer.n_gpus_per_node $N_GPUS Number of GPUs per node (e.g., 4 on Perlmutter).
trainer.nnodes 1 Number of nodes (1 node = 4 A100 GPUs on Perlmutter).
trainer.save_freq 100 Save checkpoint every 100 steps.
trainer.test_freq 100 Run validation every 100 steps.
trainer.project_name TinyZero W&B project name.
trainer.experiment_name $EXPERIMENT_NAME Unique identifier for this run.
trainer.total_epochs 15 Total training epochs.

Logging & Output

2>&1 | tee verl_demo.log
  • 2>&1: Redirects stderr to stdout.
  • tee verl_demo.log: Saves all output to verl_demo.log while printing to the terminal.

Key Concepts

  1. PPO Workflow:

    • Actor: Generates actions (responses) during rollouts.
    • Critic: Evaluates the value of states (responses).
    • KL Control: Penalizes large deviations from the reference policy to stabilize training.
  2. Batch Sizes:

    • Micro-batch: Smaller batches for gradient accumulation (memory optimization).
    • Mini-batch: Subsets of the full batch for PPO updates.
  3. Distributed Training:

    • tensor_model_parallel_size: Splits the model across GPUs (e.g., for large models).
  4. Checkpointing:

    • save_freq=100: Frequent saves allow resuming from interruptions.

Example Workflow on Perlmutter

# Set environment variables
export DATA_DIR=/global/cscratch1/sd/yourname/data
export BASE_MODEL=EleutherAI/pythia-1.4b
export ROLLOUT_TP_SIZE=2
export N_GPUS=4
export EXPERIMENT_NAME=run1

# Submit job
sbatch job_script.sh

For more details, refer to:

Step-by-Step Execution of scripts/train_tiny_zero.sh

Overview

The script scripts/train_tiny_zero.sh runs the verl.trainer.main_ppo module with a set of configuration parameters. These parameters define data paths, model paths, batch sizes, learning rates, and other training settings. Environment variables like $DATA_DIR, $BASE_MODEL, etc., are used to parameterize the script.


1. Script Execution and Configuration

When executed, the script:

  1. Loads configuration settings (model, dataset, training hyperparameters).
  2. Starts the PPO training loop using RayPPOTrainer.fit().
  3. Iterates over the training data, updating the policy and value networks.

2. Data Loading

The dataset is loaded using RLHFDataset (in verl/utils/dataset/rl_dataset.py).
Example raw data entry before processing by __getitem__:

{
  "prompt": [{"role": "user", "content": "Explain the theory of relativity."}],
  "response": "...",
  "data_source": "some_dataset",
  "extra_info": {"index": 123, "some_other_info": "value"}
}

Processed Batch (After collate_fn)

After tokenization and processing, a batch (with batch size = 1 for simplicity) looks like:

batch_dict = {
  'input_ids': torch.tensor([...token IDs for prompt...](/chunhualiao/public-docs/wiki/...token-IDs-for-prompt...)),  # Tokenized prompt
  'attention_mask': torch.tensor([1, 1, 1, ..., 1](/chunhualiao/public-docs/wiki/1,-1,-1,-...,-1)),  # Attention mask
  'position_ids': torch.tensor([0, 1, 2, ..., ...](/chunhualiao/public-docs/wiki/0,-1,-2,-...,-...)),  # Position IDs
  'response': '...',
  'data_source': 'some_dataset',
  'raw_prompt': [{'role': 'user', 'content': 'Explain the theory of relativity.'}],
  'index': 123,
  'extra_info': {'index': 123, 'some_other_info': 'value'}
}

3. Training Loop (RayPPOTrainer.fit())

The fit() method contains the main PPO training loop.

Step 1: Data Preparation

  • Converts batch_dict into a DataProto object.
  • Extracts generation-related inputs.
batch: DataProto = DataProto.from_single_dict(batch_dict)
gen_batch = batch.pop(batch_keys=['input_ids', 'attention_mask', 'position_ids'])

Now, gen_batch contains only the data needed for response generation:

gen_batch = {
 'input_ids': torch.tensor([...token IDs for prompt...](/chunhualiao/public-docs/wiki/...token-IDs-for-prompt...)),
 'attention_mask': torch.tensor([1, 1, 1, ..., 1](/chunhualiao/public-docs/wiki/1,-1,-1,-...,-1)),
 'position_ids': torch.tensor([0, 1, 2, ..., ...](/chunhualiao/public-docs/wiki/0,-1,-2,-...,-...))
}

Step 2: Generation (Rollout)

  • The actor network generates a response to the prompt.
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)

Assume the generated response is "General relativity explains gravity...".
The output (gen_batch_output) contains:

gen_batch_output = {
    'responses': torch.tensor([...token IDs for response...](/chunhualiao/public-docs/wiki/...token-IDs-for-response...)),  # Tokenized response
    'old_log_probs': torch.tensor([-2.5, -1.8, ..., -3.2](/chunhualiao/public-docs/wiki/-2.5,--1.8,-...,--3.2)),  # Log probabilities
    'attention_mask': torch.tensor([1, 1, 1, ..., 0, 0](/chunhualiao/public-docs/wiki/1,-1,-1,-...,-0,-0))  # Attention mask
}
  • The batch is updated with new response data:
batch.batch['uid'] = np.array([str(uuid.uuid4()) for _ in range(len(batch.batch))], dtype=object)
batch = batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=True)
batch = batch.union(gen_batch_output)

Step 3: Sequence Length Balancing

Balances batch sequence lengths across processes:

self._balance_batch(batch, metrics=metrics)

Step 4: Reference Policy Log Probabilities (Optional)

  • If a reference policy is used, its log probabilities for the generated response are computed:
if self.use_reference_policy:
    ref_log_prob = self.ref_policy_wg.compute_ref_log_prob(batch)
    batch = batch.union(ref_log_prob)

Example log probabilities:

ref_log_prob = {
    'ref_log_prob': torch.tensor([-2.7, -1.9, ..., -3.0](/chunhualiao/public-docs/wiki/-2.7,--1.9,-...,--3.0))
}

Step 5: Value Estimation

  • The critic network estimates state values (prompt + response).
if self.use_critic:
    values = self.critic_wg.compute_values(batch)
    batch = batch.union(values)

Example estimated values:

values = {
    'values': torch.tensor([-0.1, 0.2, ..., 0.5](/chunhualiao/public-docs/wiki/-0.1,-0.2,-...,-0.5))
}

Step 6: Reward Calculation and Advantage Estimation

  • Computes reward using the reward function.
reward_tensor = self.reward_fn(batch)
batch.batch['token_level_scores'] = reward_tensor

Example rewards:

reward_tensor = torch.tensor([0.1, 0.3, 0.5, ..., 0.8](/chunhualiao/public-docs/wiki/0.1,-0.3,-0.5,-...,-0.8))
  • Applies KL penalty (if needed):
if not self.config.actor_rollout_ref.actor.use_kl_loss:
    batch, kl_metrics = apply_kl_penalty(batch, kl_ctrl=self.kl_ctrl, kl_penalty=self.config.algorithm.kl_penalty)
    metrics.update(kl_metrics)
else:
    batch.batch['token_level_rewards'] = batch.batch['token_level_scores']
  • Computes advantage function (GAE or GRPO):
batch = compute_advantage(batch, ...)

Example computed values:

batch.batch['advantages'] = torch.tensor([0.05, 0.1, ..., 0.2](/chunhualiao/public-docs/wiki/0.05,-0.1,-...,-0.2))  # Example advantages
batch.batch['returns'] = torch.tensor([0.15, 0.4, ..., 1.0](/chunhualiao/public-docs/wiki/0.15,-0.4,-...,-1.0))  # Example returns

Step 7: Critic Update

  • The critic network updates based on returns.
if self.use_critic:
    critic_output = self.critic_wg.update_critic(batch)

Step 8: Actor Update

  • The policy network (actor) is updated.
if self.config.trainer.critic_warmup <= self.global_steps:
    actor_output = self.actor_rollout_wg.update_actor(batch)

Step 9: Metrics and Logging

  • Logs key metrics (policy loss, value loss, KL divergence).

Step 10: Checkpointing (Periodically)

  • Saves model weights periodically.

Summary

  1. Script Execution loads configurations and starts the PPO training loop.
  2. Data Loading converts input JSON into tokenized tensors.
  3. Training Loop:
    • Data Preparation: Extracts required tensors for training.
    • Response Generation: Uses actor network.
    • Reference Policy Check (Optional): Computes log probabilities.
    • Value Estimation: Uses critic network.
    • Reward Calculation & Advantage Computation.
    • Critic and Actor Updates.
    • Logging and Checkpointing.

Final Notes

  • The entire process repeats for multiple batches and epochs.
  • Over time, the model learns to generate better responses by maximizing rewards and improving the policy.

🚀 This structured approach ensures stability and performance in PPO training. 🎯

Does the Code Use Group Relative Policy Optimization (GRPO) or Classic PPO?

Summary

The scripts/train_tiny_zero.sh script is configured to use classic PPO (Proximal Policy Optimization) with Generalized Advantage Estimation (GAE), not Group Relative Policy Optimization (GRPO).

Analysis of the Code

The RayPPOTrainer class in verl/trainer/ppo/ray_trainer.py determines whether to use GAE (classic PPO) or GRPO based on the configuration parameter algorithm.adv_estimator.

Key Findings:

  1. Default Configuration (verl/trainer/config/ppo_trainer.yaml)

    • The parameter algorithm.adv_estimator is set to "gae" by default.
    • The script train_tiny_zero.sh does not override this setting.
    • Therefore, the default behavior is classic PPO with GAE.
  2. Additional GRPO-Related Parameters

    • actor_rollout_ref.rollout.n: 1
      • Comment in config: # > 1 for grpo
      • train_tiny_zero.sh does not override this.
    • actor_rollout_ref.actor.use_kl_loss: False
      • Comment in config: # True for GRPO
      • train_tiny_zero.sh does not override this.
  3. How the Decision is Made in Code

    • The function compute_advantage is called in RayPPOTrainer.fit().
    • compute_advantage calls either:
      • compute_gae_advantage_return (used in classic PPO with GAE).
      • compute_grpo_outcome_advantage (used for GRPO).
    • The choice is determined by algorithm.adv_estimator.
    • Since the default setting is "gae", the script uses classic PPO.

Conclusion

Although the codebase supports GRPO, the scripts/train_tiny_zero.sh script does not enable it.
To enable GRPO, the script would need to be modified to:

algorithm:
  adv_estimator: grpo  # Change from 'gae' to 'grpo'
actor_rollout_ref:
  rollout:
    n: >1  # Set to a value greater than 1
  actor:
    use_kl_loss: True  # Enable KL loss

🚀 As provided, TinyZero uses Classic PPO, not GRPO. 🎯