veRL:trainer main_generation.py - chunhualiao/public-docs GitHub Wiki

veRL

Configuration file

verl/trainer/config/generation.yaml

trainer:
  nnodes: 1
  n_gpus_per_node: 1

actor:
  ulysses_sequence_parallel_size: 1

data:
  path: my_dataset/test.parquet
  prompt_key: prompt
  n_samples: 1
  output_path: my_dataset/test_answers.parquet
  batch_size: 1

model:
  path: models/Qwen2.5-3B-Instruct
  external_lib: null
rollout:
  name: vllm
  temperature: 1.0
  n: 1
  top_k: 50 # 0 for hf rollout, -1 for vllm rollout
  top_p: 0.7
  prompt_length: 1536
  response_length: 512
  # for vllm rollout
  dtype: bfloat16 # should align with FSDP
  gpu_memory_utilization: 0.5
  ignore_eos: False
  micro_batch_size: 256
  enforce_eager: True
  free_cache_engine: True
  load_format: dummy_dtensor
  tensor_model_parallel_size: 1
  max_num_batched_tokens: 8192
  max_num_seqs: 1024
  log_prob_micro_batch_size: 8
  # for hf rollout
  do_sample: True

High-Level Explanation

The provided code is a Python script designed to generate responses to a dataset of prompts using a machine learning model. It leverages the Ray framework for distributed computing and the Hugging Face Transformers library for natural language processing. The script reads a dataset of prompts, processes them, generates responses using a pre-trained model, and saves the results to a new dataset file.

Low-Level Explanation

1. Imports and Environment Setup

The script imports necessary libraries such as ray, numpy, hydra, os, pandas, transformers, and custom modules from the verl package.
It sets environment variables for debugging and parallelism.

2. Model and Tokenizer Initialization

The script initializes a tokenizer using the Hugging Face AutoTokenizer class.
It sets the padding side to 'left' and defines a pad token if it doesn't exist.

3. Dataset Reading

The script reads a dataset from a Parquet file using pandas.
It extracts the prompts from the dataset and converts them to a list format.

4. Ray Worker Setup

The script sets up a Ray worker group using RayClassWithInitArgs, RayResourcePool, and RayWorkerGroup.
It initializes the model on the worker group.

5. Batch Processing

The script processes the dataset in batches to handle large datasets efficiently.
For each batch, it tokenizes the prompts, computes position IDs, and prepares the input data.

6. Response Generation

The script generates responses for each prompt in the batch using the model.
It handles dummy data to ensure the batch size is divisible by the data parallelism size.

7. Post-Processing

The script removes dummy data and padding from the generated responses.
It collects the responses and transposes the list to match the desired output format.

8. Saving Results

The script adds the generated responses to the original dataset and saves the updated dataset to a new Parquet file.

Detailed Breakdown

1. Environment Setup

os.environ['NCCL_DEBUG'] = 'WARN'
os.environ['TOKENIZERS_PARALLELISM'] = 'true'

Sets environment variables for debugging and enabling parallelism in tokenizers.