train_tiny_zero.sh - chunhualiao/public-docs GitHub Wiki
explain the command below in details, including what each option does:
The script scripts/train_tiny_zero.sh is used to train a model using the PPO (Proximal Policy Optimization) algorithm. It executes the verl.trainer.main_ppo
python3 -m verl.trainer.main_ppo \
data.train_files=$DATA_DIR/train.parquet \
data.val_files=$DATA_DIR/test.parquet \
data.train_batch_size=256 \
data.val_batch_size=1312 \
data.max_prompt_length=256 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=$BASE_MODEL \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=128 \
actor_rollout_ref.actor.ppo_micro_batch_size=8 \
actor_rollout_ref.rollout.log_prob_micro_batch_size=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
critic.optim.lr=1e-5 \
critic.model.path=$BASE_MODEL \
critic.ppo_micro_batch_size=8 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=['wandb'] \
trainer.val_before_train=False \
trainer.default_hdfs_dir=null \
trainer.n_gpus_per_node=$N_GPUS \
trainer.nnodes=1 \
trainer.save_freq=100 \
trainer.test_freq=100 \
trainer.project_name=TinyZero \
trainer.experiment_name=$EXPERIMENT_NAME \
trainer.total_epochs=15 2>&1 | tee verl_demo.log
Here's a detailed breakdown of the command and its components:
Base Command
python3 -m verl.trainer.main_ppo
- Purpose: Runs the Proximal Policy Optimization (PPO) training script from the
verl
library. -m
: Executes the specified module (verl.trainer.main_ppo
) as a script.
Data Configuration
Parameter | Value | Description |
---|---|---|
data.train_files |
$DATA_DIR/train.parquet |
Path to training data (Parquet format). |
data.val_files |
$DATA_DIR/test.parquet |
Path to validation/evaluation data. |
data.train_batch_size |
256 |
Batch size for training. |
data.val_batch_size |
1312 |
Batch size for validation. |
data.max_prompt_length |
256 |
Truncate prompts to this token length. |
data.max_response_length |
1024 |
Truncate responses to this token length. |
Actor/Rollout Configuration
Parameter | Value | Description |
---|---|---|
actor_rollout_ref.model.path |
$BASE_MODEL |
Path to the base model (e.g., HuggingFace model) used for the actor. |
actor_rollout_ref.actor.optim.lr |
1e-6 |
Learning rate for the actor’s optimizer. |
actor_rollout_ref.actor.ppo_mini_batch_size |
128 |
Mini-batch size for PPO updates. |
actor_rollout_ref.actor.ppo_micro_batch_size |
8 |
Micro-batch size for gradient accumulation. |
actor_rollout_ref.rollout.log_prob_micro_batch_size |
8 |
Batch size for computing log probabilities during rollouts. |
actor_rollout_ref.rollout.tensor_model_parallel_size |
$ROLLOUT_TP_SIZE |
Tensor parallelism degree for distributed inference. |
actor_rollout_ref.rollout.gpu_memory_utilization |
0.4 |
Fraction of GPU memory reserved for rollouts (prevents OOM). |
actor_rollout_ref.ref.log_prob_micro_batch_size |
4 |
Batch size for reference model log probability calculations. |
Critic Configuration
Parameter | Value | Description |
---|---|---|
critic.optim.lr |
1e-5 |
Learning rate for the critic’s optimizer. |
critic.model.path |
$BASE_MODEL |
Base model for the critic (often same as actor). |
critic.ppo_micro_batch_size |
8 |
Micro-batch size for critic updates. |
Algorithm Configuration
Parameter | Value | Description |
---|---|---|
algorithm.kl_ctrl.kl_coef |
0.001 |
Coefficient for KL divergence penalty (controls deviation from the reference policy). |
Training Infrastructure & Logging
Parameter | Value | Description |
---|---|---|
trainer.logger |
['wandb'] |
Log metrics to Weights & Biases (W&B). |
trainer.val_before_train |
False |
Skip validation before starting training. |
trainer.default_hdfs_dir |
null |
Disable HDFS storage (using local/NERSC storage). |
trainer.n_gpus_per_node |
$N_GPUS |
Number of GPUs per node (e.g., 4 on Perlmutter). |
trainer.nnodes |
1 |
Number of nodes (1 node = 4 A100 GPUs on Perlmutter). |
trainer.save_freq |
100 |
Save checkpoint every 100 steps. |
trainer.test_freq |
100 |
Run validation every 100 steps. |
trainer.project_name |
TinyZero |
W&B project name. |
trainer.experiment_name |
$EXPERIMENT_NAME |
Unique identifier for this run. |
trainer.total_epochs |
15 |
Total training epochs. |
Logging & Output
2>&1 | tee verl_demo.log
2>&1
: Redirectsstderr
tostdout
.tee verl_demo.log
: Saves all output toverl_demo.log
while printing to the terminal.
Key Concepts
-
PPO Workflow:
- Actor: Generates actions (responses) during rollouts.
- Critic: Evaluates the value of states (responses).
- KL Control: Penalizes large deviations from the reference policy to stabilize training.
-
Batch Sizes:
- Micro-batch: Smaller batches for gradient accumulation (memory optimization).
- Mini-batch: Subsets of the full batch for PPO updates.
-
Distributed Training:
tensor_model_parallel_size
: Splits the model across GPUs (e.g., for large models).
-
Checkpointing:
save_freq=100
: Frequent saves allow resuming from interruptions.
Example Workflow on Perlmutter
# Set environment variables
export DATA_DIR=/global/cscratch1/sd/yourname/data
export BASE_MODEL=EleutherAI/pythia-1.4b
export ROLLOUT_TP_SIZE=2
export N_GPUS=4
export EXPERIMENT_NAME=run1
# Submit job
sbatch job_script.sh
For more details, refer to:
scripts/train_tiny_zero.sh
Step-by-Step Execution of Overview
The script scripts/train_tiny_zero.sh
runs the verl.trainer.main_ppo
module with a set of configuration parameters. These parameters define data paths, model paths, batch sizes, learning rates, and other training settings. Environment variables like $DATA_DIR
, $BASE_MODEL
, etc., are used to parameterize the script.
1. Script Execution and Configuration
When executed, the script:
- Loads configuration settings (model, dataset, training hyperparameters).
- Starts the PPO training loop using
RayPPOTrainer.fit()
. - Iterates over the training data, updating the policy and value networks.
2. Data Loading
The dataset is loaded using RLHFDataset
(in verl/utils/dataset/rl_dataset.py
).
Example raw data entry before processing by __getitem__
:
{
"prompt": [{"role": "user", "content": "Explain the theory of relativity."}],
"response": "...",
"data_source": "some_dataset",
"extra_info": {"index": 123, "some_other_info": "value"}
}
collate_fn
)
Processed Batch (After After tokenization and processing, a batch (with batch size = 1 for simplicity) looks like:
batch_dict = {
'input_ids': torch.tensor([...token IDs for prompt...](/chunhualiao/public-docs/wiki/...token-IDs-for-prompt...)), # Tokenized prompt
'attention_mask': torch.tensor([1, 1, 1, ..., 1](/chunhualiao/public-docs/wiki/1,-1,-1,-...,-1)), # Attention mask
'position_ids': torch.tensor([0, 1, 2, ..., ...](/chunhualiao/public-docs/wiki/0,-1,-2,-...,-...)), # Position IDs
'response': '...',
'data_source': 'some_dataset',
'raw_prompt': [{'role': 'user', 'content': 'Explain the theory of relativity.'}],
'index': 123,
'extra_info': {'index': 123, 'some_other_info': 'value'}
}
RayPPOTrainer.fit()
)
3. Training Loop (The fit()
method contains the main PPO training loop.
Step 1: Data Preparation
- Converts
batch_dict
into aDataProto
object. - Extracts generation-related inputs.
batch: DataProto = DataProto.from_single_dict(batch_dict)
gen_batch = batch.pop(batch_keys=['input_ids', 'attention_mask', 'position_ids'])
Now, gen_batch
contains only the data needed for response generation:
gen_batch = {
'input_ids': torch.tensor([...token IDs for prompt...](/chunhualiao/public-docs/wiki/...token-IDs-for-prompt...)),
'attention_mask': torch.tensor([1, 1, 1, ..., 1](/chunhualiao/public-docs/wiki/1,-1,-1,-...,-1)),
'position_ids': torch.tensor([0, 1, 2, ..., ...](/chunhualiao/public-docs/wiki/0,-1,-2,-...,-...))
}
Step 2: Generation (Rollout)
- The actor network generates a response to the prompt.
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
Assume the generated response is "General relativity explains gravity...".
The output (gen_batch_output
) contains:
gen_batch_output = {
'responses': torch.tensor([...token IDs for response...](/chunhualiao/public-docs/wiki/...token-IDs-for-response...)), # Tokenized response
'old_log_probs': torch.tensor([-2.5, -1.8, ..., -3.2](/chunhualiao/public-docs/wiki/-2.5,--1.8,-...,--3.2)), # Log probabilities
'attention_mask': torch.tensor([1, 1, 1, ..., 0, 0](/chunhualiao/public-docs/wiki/1,-1,-1,-...,-0,-0)) # Attention mask
}
- The batch is updated with new response data:
batch.batch['uid'] = np.array([str(uuid.uuid4()) for _ in range(len(batch.batch))], dtype=object)
batch = batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=True)
batch = batch.union(gen_batch_output)
Step 3: Sequence Length Balancing
Balances batch sequence lengths across processes:
self._balance_batch(batch, metrics=metrics)
Step 4: Reference Policy Log Probabilities (Optional)
- If a reference policy is used, its log probabilities for the generated response are computed:
if self.use_reference_policy:
ref_log_prob = self.ref_policy_wg.compute_ref_log_prob(batch)
batch = batch.union(ref_log_prob)
Example log probabilities:
ref_log_prob = {
'ref_log_prob': torch.tensor([-2.7, -1.9, ..., -3.0](/chunhualiao/public-docs/wiki/-2.7,--1.9,-...,--3.0))
}
Step 5: Value Estimation
- The critic network estimates state values (prompt + response).
if self.use_critic:
values = self.critic_wg.compute_values(batch)
batch = batch.union(values)
Example estimated values:
values = {
'values': torch.tensor([-0.1, 0.2, ..., 0.5](/chunhualiao/public-docs/wiki/-0.1,-0.2,-...,-0.5))
}
Step 6: Reward Calculation and Advantage Estimation
- Computes reward using the reward function.
reward_tensor = self.reward_fn(batch)
batch.batch['token_level_scores'] = reward_tensor
Example rewards:
reward_tensor = torch.tensor([0.1, 0.3, 0.5, ..., 0.8](/chunhualiao/public-docs/wiki/0.1,-0.3,-0.5,-...,-0.8))
- Applies KL penalty (if needed):
if not self.config.actor_rollout_ref.actor.use_kl_loss:
batch, kl_metrics = apply_kl_penalty(batch, kl_ctrl=self.kl_ctrl, kl_penalty=self.config.algorithm.kl_penalty)
metrics.update(kl_metrics)
else:
batch.batch['token_level_rewards'] = batch.batch['token_level_scores']
- Computes advantage function (GAE or GRPO):
batch = compute_advantage(batch, ...)
Example computed values:
batch.batch['advantages'] = torch.tensor([0.05, 0.1, ..., 0.2](/chunhualiao/public-docs/wiki/0.05,-0.1,-...,-0.2)) # Example advantages
batch.batch['returns'] = torch.tensor([0.15, 0.4, ..., 1.0](/chunhualiao/public-docs/wiki/0.15,-0.4,-...,-1.0)) # Example returns
Step 7: Critic Update
- The critic network updates based on returns.
if self.use_critic:
critic_output = self.critic_wg.update_critic(batch)
Step 8: Actor Update
- The policy network (actor) is updated.
if self.config.trainer.critic_warmup <= self.global_steps:
actor_output = self.actor_rollout_wg.update_actor(batch)
Step 9: Metrics and Logging
- Logs key metrics (policy loss, value loss, KL divergence).
Step 10: Checkpointing (Periodically)
- Saves model weights periodically.
Summary
- Script Execution loads configurations and starts the PPO training loop.
- Data Loading converts input JSON into tokenized tensors.
- Training Loop:
- Data Preparation: Extracts required tensors for training.
- Response Generation: Uses actor network.
- Reference Policy Check (Optional): Computes log probabilities.
- Value Estimation: Uses critic network.
- Reward Calculation & Advantage Computation.
- Critic and Actor Updates.
- Logging and Checkpointing.
Final Notes
- The entire process repeats for multiple batches and epochs.
- Over time, the model learns to generate better responses by maximizing rewards and improving the policy.
🚀 This structured approach ensures stability and performance in PPO training. 🎯
Does the Code Use Group Relative Policy Optimization (GRPO) or Classic PPO?
Summary
The scripts/train_tiny_zero.sh script is configured to use classic PPO (Proximal Policy Optimization) with Generalized Advantage Estimation (GAE), not Group Relative Policy Optimization (GRPO).
Analysis of the Code
The RayPPOTrainer class in verl/trainer/ppo/ray_trainer.py
determines whether to use GAE (classic PPO) or GRPO based on the configuration parameter algorithm.adv_estimator.
Key Findings:
-
Default Configuration (
verl/trainer/config/ppo_trainer.yaml
)- The parameter algorithm.adv_estimator is set to
"gae"
by default. - The script train_tiny_zero.sh does not override this setting.
- Therefore, the default behavior is classic PPO with GAE.
- The parameter algorithm.adv_estimator is set to
-
Additional GRPO-Related Parameters
actor_rollout_ref.rollout.n: 1
- Comment in config:
# > 1 for grpo
- train_tiny_zero.sh does not override this.
- Comment in config:
actor_rollout_ref.actor.use_kl_loss: False
- Comment in config:
# True for GRPO
- train_tiny_zero.sh does not override this.
- Comment in config:
-
How the Decision is Made in Code
- The function
compute_advantage
is called inRayPPOTrainer.fit()
. compute_advantage
calls either:compute_gae_advantage_return
(used in classic PPO with GAE).compute_grpo_outcome_advantage
(used for GRPO).
- The choice is determined by algorithm.adv_estimator.
- Since the default setting is
"gae"
, the script uses classic PPO.
- The function
Conclusion
Although the codebase supports GRPO, the scripts/train_tiny_zero.sh script does not enable it.
To enable GRPO, the script would need to be modified to:
algorithm:
adv_estimator: grpo # Change from 'gae' to 'grpo'
actor_rollout_ref:
rollout:
n: >1 # Set to a value greater than 1
actor:
use_kl_loss: True # Enable KL loss
🚀 As provided, TinyZero uses Classic PPO, not GRPO. 🎯