veRL:trainer main_generation.py - chunhualiao/public-docs GitHub Wiki

TinyZero

Configuration file

verl/trainer/config/generation.yaml

trainer:
  nnodes: 1
  n_gpus_per_node: 1

actor:
  ulysses_sequence_parallel_size: 1

data:
  path: my_dataset/test.parquet
  prompt_key: prompt
  n_samples: 1
  output_path: my_dataset/test_answers.parquet
  batch_size: 1

model:
  path: models/Qwen2.5-3B-Instruct
  external_lib: null
rollout:
  name: vllm
  temperature: 1.0
  n: 1
  top_k: 50 # 0 for hf rollout, -1 for vllm rollout
  top_p: 0.7
  prompt_length: 1536
  response_length: 512
  # for vllm rollout
  dtype: bfloat16 # should align with FSDP
  gpu_memory_utilization: 0.5
  ignore_eos: False
  micro_batch_size: 256
  enforce_eager: True
  free_cache_engine: True
  load_format: dummy_dtensor
  tensor_model_parallel_size: 1
  max_num_batched_tokens: 8192
  max_num_seqs: 1024
  log_prob_micro_batch_size: 8
  # for hf rollout
  do_sample: True

High-Level Explanation

The provided code is a Python script designed to generate responses to a dataset of prompts using a machine learning model. It leverages the Ray framework for distributed computing and the Hugging Face Transformers library for natural language processing. The script reads a dataset of prompts, processes them, generates responses using a pre-trained model, and saves the results to a new dataset file.

Low-Level Explanation

1. Imports and Environment Setup

  • The script imports necessary libraries such as ray, numpy, hydra, os, pandas, transformers, and custom modules from the verl package.
  • It sets environment variables for debugging and parallelism.

2. Model and Tokenizer Initialization

  • The script initializes a tokenizer using the Hugging Face AutoTokenizer class.
  • It sets the padding side to 'left' and defines a pad token if it doesn't exist.

3. Dataset Reading

  • The script reads a dataset from a Parquet file using pandas.
  • It extracts the prompts from the dataset and converts them to a list format.

4. Ray Worker Setup

  • The script sets up a Ray worker group using RayClassWithInitArgs, RayResourcePool, and RayWorkerGroup.
  • It initializes the model on the worker group.

5. Batch Processing

  • The script processes the dataset in batches to handle large datasets efficiently.
  • For each batch, it tokenizes the prompts, computes position IDs, and prepares the input data.

6. Response Generation

  • The script generates responses for each prompt in the batch using the model.
  • It handles dummy data to ensure the batch size is divisible by the data parallelism size.

7. Post-Processing

  • The script removes dummy data and padding from the generated responses.
  • It collects the responses and transposes the list to match the desired output format.

8. Saving Results

  • The script adds the generated responses to the original dataset and saves the updated dataset to a new Parquet file.

Detailed Breakdown

1. Environment Setup

os.environ['NCCL_DEBUG'] = 'WARN'
os.environ['TOKENIZERS_PARALLELISM'] = 'true'
  • Sets environment variables for debugging and enabling parallelism in tokenizers.

2. Model and Tokenizer Initialization

tokenizer = hf_tokenizer(local_path)
  • Initializes a tokenizer from a local path using a custom function hf_tokenizer.

3. Dataset Reading

dataset = pd.read_parquet(config.data.path)
chat_lst = dataset[config.data.prompt_key].tolist()
  • Reads the dataset from a Parquet file and extracts the prompts.

4. Ray Worker Setup

ray_cls_with_init = RayClassWithInitArgs(cls=ray.remote(ActorRolloutRefWorker), config=config, role='rollout')
resource_pool = RayResourcePool(process_on_nodes=[config.trainer.n_gpus_per_node] * config.trainer.nnodes)
wg = RayWorkerGroup(resource_pool=resource_pool, ray_cls_with_init=ray_cls_with_init)
wg.init_model()
  • Sets up a Ray worker group with the specified configuration and initializes the model.

5. Batch Processing

for batch_idx in range(num_batch):
    batch_chat_lst = chat_lst[batch_idx * config_batch_size:(batch_idx + 1) * config_batch_size]
    inputs = tokenizer.apply_chat_template(batch_chat_lst, ...)
  • Processes the dataset in batches, tokenizes the prompts, and prepares the input data.

6. Response Generation

output = wg.generate_sequences(data)
output_text = tokenizer.batch_decode(output.batch['input_ids'][:, -config.rollout.response_length:], ...)
  • Generates responses for each batch and decodes the output tokens to text.

7. Post-Processing

output_text_unpad = [text.replace(pad_token, '') for text in output_text]
output_lst[i].extend(output_text_unpad)
  • Removes padding from the generated responses and collects them.

8. Saving Results

dataset[f'responses'] = output_lst
dataset.to_parquet(config.data.output_path)
  • Adds the generated responses to the dataset and saves it to a new Parquet file.

This script is designed to efficiently generate responses to a large dataset of prompts using distributed computing and natural language processing techniques.