TinyZero - chunhualiao/public-docs GitHub Wiki

later related work absolute zero 2025

Goals

how does it work? https://github.com/Jiayi-Pan/TinyZero
- veRL: Volcano Engine Reinforcement Learning for LLM veRL
- TODO: submit perlmutter job of 10 hours, // 4 hours limit: 4 GPU, log.txt
- TODO if it finishes. where to find things?
- TODO checkpointing and resuming
- TODO: how things are set up for reinforcement learning? agent, env, reward, actions, etc.
how to adapt it to solve my problems: code translation, code optimization, doc generation, etc?
do not use conda. it disappeared after re-login!
- use venv instead!! perlmutter@nersc

Basics

reinforcement learning, Ray, veRL

veRL:training veRL Implements PPO for Training LLMs
TinyZero:dataset veRL implements PPO (Proximal Policy Optimization) for training Large Language Models (LLMs) with the following key RL components:

1. Action Space

Actions are token predictions from the LLM.
The action space is the vocabulary size of the model.
Actions are sampled using the model's output logits during generation.

TinyZero:Action

2. Environment

The environment is implicit in the conversation/task setup.
Input prompts serve as the initial state.
The model generates responses (actions) token by token.
The environment transitions are deterministic based on token generation.

3. Reward Function

Implemented through the RewardManager class.
Supports multiple reward types:
- Rule-based rewards (e.g., GSM8K math problem scoring).
- Model-based rewards (using a separate reward model).
- KL penalties to prevent too much deviation from the reference policy.
Rewards can be computed at token-level or sequence-level.

TinyZero:reward function

4. Policy

The LLM itself serves as the policy network.
Uses actor-critic architecture:
- Actor: The LLM generating responses.
- Critic: Value network estimating expected returns.
Reference policy for KL control.
Supports both FSDP and Megatron parallel training strategies.

5. Training Loop

Implements the PPO algorithm with:
- Experience collection through batch generation.
- Advantage estimation (GAE or GRPO).
- Policy updates with KL divergence control.
- Value function updates.
- Sequence length balancing for efficient parallel training.

TinyZero:training

6. Distributed Architecture

Ray-based distributed training.
Resource pool management for different roles (Actor, Critic, Reference Policy).
Support for model parallelism and data parallelism.
Efficient batch processing with sequence length balancing.

Modularity and Extensibility

The system is designed to be modular and extensible, allowing for:

Different reward computation methods.
Multiple parallel training strategies.
Various advantage estimation techniques.
Flexible model architectures and tokenizers.
Custom environment configurations through prompts and reward functions.

This architecture enables effective RL fine-tuning of LLMs while handling the complexities of large model training and distributed computation.

Preparation

[liaoch@nid200349:~/workspace-scratch/TinyZero

https://gist.github.com/chunhualiao/e901003664f01c69ba840670cf85a609

pip install -r requirements.txt

export WANDB_API_KEY=your_api_key  ## replace it with real one!!!


export N_GPUS=4
export BASE_MODEL=$PWD/models/Qwen2.5-3B-Instruct
export DATA_DIR=$PWD/my_dataset
export ROLLOUT_TP_SIZE=4
export EXPERIMENT_NAME=countdown-qwen2.5-3b-instruct-4gpu
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero.sh

where is the model saved? checkpoints directory
test before and after RL-based training
test submission to job management for full-scale training

training

train_tiny_zero.sh

Initializes model parallel training with 4-way tensor slicing

Loads 327,680 training samples from my_dataset/train.parquet

Uses XFormers optimized attention kernels

Saves checkpoints to checkpoints/countdown-qwen2.5-3b-instruct-4gpu

Implements RL training with:

PPO algorithm
Reward modeling
Experience rollout
Policy updates

Key Components:

# From train_tiny_zero.sh implementation:
trainer = PPOTrainer(
    model=AutoModelForCausalLM.from_pretrained(BASE_MODEL),
    reward_model=RuleBasedReward(),
    tokenizer=AutoTokenizer.from_pretrained(BASE_MODEL),
    rollout_parallel_size=ROLLOUT_TP_SIZE,
    experiment_name=EXPERIMENT_NAME
)
trainer.train(data_dir=DATA_DIR)

Expected Output:

Training logs with metrics (loss, reward scores)
Regular model checkpoints
WandB integration for experiment tracking
Final model in checkpoints/[EXPERIMENT_NAME]/final_model

Details

Script Breakdown: train_tiny_zero.sh

The script train_tiny_zero.sh is a shell script designed to execute a training process for a machine learning model using the PPO (Proximal Policy Optimization) algorithm within the Verl framework. Here's a detailed breakdown of its components and functionality:

1. Command Execution

The script runs a Python command using python3 -m verl.trainer.main_ppo, which initializes the PPO training process.

2. Data Configuration

data.train_files: Specifies the path to the training data file, expected to be a Parquet file located in the directory referenced by $DATA_DIR/train.parquet.
data.val_files: Specifies the path to the validation data file, similarly located at $DATA_DIR/test.parquet.
data.train_batch_size: Sets the batch size for training to 256 samples.
data.val_batch_size: Sets the batch size for validation to 1312 samples.
data.max_prompt_length and data.max_response_length: Define the maximum sequence lengths for prompts (256 tokens) and responses (1024 tokens), respectively.

3. Model and Training Parameters

actor_rollout_ref.model.path: The path to the base model used for training, referenced by the variable $BASE_MODEL.
actor_rollout_ref.actor.optim.lr: Sets the learning rate for the actor optimizer to 1e-6.
actor_rollout_ref.actor.ppo_mini_batch_size: Configures the mini-batch size for PPO training to 128.
actor_rollout_ref.actor.ppo_micro_batch_size: Sets the micro-batch size to 8.
actor_rollout_ref.rollout.log_prob_micro_batch_size: Configures the micro-batch size for log probability calculations during rollout to 8.
actor_rollout_ref.rollout.tensor_model_parallel_size: Determines the tensor model parallel size, set via the variable $ROLLOUT_TP_SIZE.
actor_rollout_ref.rollout.gpu_memory_utilization: Sets GPU memory utilization to 0.4 (40%).
actor_rollout_ref.ref.log_prob_micro_batch_size: Configures the reference model's micro-batch size for log probabilities to 4.

4. Critic Configuration

critic.optim.lr: Sets the learning rate for the critic optimizer to 1e-5.
critic.model.path: Specifies the model path for the critic, using the same base model as the actor.
critic.ppo_micro_batch_size: Sets the micro-batch size for the critic's PPO training to 8.

5. Algorithm and Training Settings

algorithm.kl_ctrl.kl_coef: Sets the KL coefficient for KL control to 0.001.
trainer.logger: Enables logging using Weights & Biases (wandb).
trainer.val_before_train: Disables validation before training by setting it to False.
trainer.default_hdfs_dir: Configures the default HDFS directory to null, indicating no default HDFS storage.
trainer.n_gpus_per_node: Specifies the number of GPUs per node, set via the variable $N_GPUS.
trainer.nnodes: Sets the number of nodes to 1, indicating a single-machine training setup.
trainer.save_freq and trainer.test_freq: Configure the model to save every 100 iterations and test every 100 iterations, respectively.
trainer.project_name and trainer.experiment_name: Set the project name to "TinyZero" and the experiment name to the value of $EXPERIMENT_NAME.
trainer.total_epochs: Specifies the total number of training epochs as 15.

6. Logging and Output

The script redirects both standard output and standard error to a log file named verl_demo.log using 2>&1 | tee verl_demo.log, ensuring that all training logs are captured for future reference.

Flexibility and Customization

This script is designed to be flexible, allowing users to customize key parameters such as data paths, model configurations, and training settings by modifying the environment variables ($DATA_DIR, $BASE_MODEL, $ROLLOUT_TP_SIZE, $N_GPUS, $EXPERIMENT_NAME). The use of variables makes the script adaptable to different training environments and experimental setups without needing to edit the script itself. The integration with Weights & Biases facilitates comprehensive logging and monitoring of the training process, aiding in hyperparameter tuning and performance analysis.

Troubleshooting

(main_task pid=490163) /global/homes/l/liaoch/.local/perlmutter/pytorch1.13.1/lib/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:

(main_task pid=490163) No module named 'vllm._version'

# First uninstall existing vllm
pip uninstall -y vllm

# Install vllm 0.6.3 from GitHub with build isolation disabled
pip install git+https://github.com/vllm-project/[email protected] --no-build-isolation

# Verify installation
python -c "from vllm.version import __version__; print(__version__)"