veRL:trainer main_generation.py - chunhualiao/public-docs GitHub Wiki
Configuration file
verl/trainer/config/generation.yaml
trainer:
nnodes: 1
n_gpus_per_node: 1
actor:
ulysses_sequence_parallel_size: 1
data:
path: my_dataset/test.parquet
prompt_key: prompt
n_samples: 1
output_path: my_dataset/test_answers.parquet
batch_size: 1
model:
path: models/Qwen2.5-3B-Instruct
external_lib: null
rollout:
name: vllm
temperature: 1.0
n: 1
top_k: 50 # 0 for hf rollout, -1 for vllm rollout
top_p: 0.7
prompt_length: 1536
response_length: 512
# for vllm rollout
dtype: bfloat16 # should align with FSDP
gpu_memory_utilization: 0.5
ignore_eos: False
micro_batch_size: 256
enforce_eager: True
free_cache_engine: True
load_format: dummy_dtensor
tensor_model_parallel_size: 1
max_num_batched_tokens: 8192
max_num_seqs: 1024
log_prob_micro_batch_size: 8
# for hf rollout
do_sample: True
High-Level Explanation
The provided code is a Python script designed to generate responses to a dataset of prompts using a machine learning model. It leverages the Ray framework for distributed computing and the Hugging Face Transformers library for natural language processing. The script reads a dataset of prompts, processes them, generates responses using a pre-trained model, and saves the results to a new dataset file.
Low-Level Explanation
1. Imports and Environment Setup
- The script imports necessary libraries such as
ray
,numpy
,hydra
,os
,pandas
,transformers
, and custom modules from theverl
package. - It sets environment variables for debugging and parallelism.
2. Model and Tokenizer Initialization
- The script initializes a tokenizer using the Hugging Face
AutoTokenizer
class. - It sets the padding side to 'left' and defines a pad token if it doesn't exist.
3. Dataset Reading
- The script reads a dataset from a Parquet file using
pandas
. - It extracts the prompts from the dataset and converts them to a list format.
4. Ray Worker Setup
- The script sets up a Ray worker group using
RayClassWithInitArgs
,RayResourcePool
, andRayWorkerGroup
. - It initializes the model on the worker group.
5. Batch Processing
- The script processes the dataset in batches to handle large datasets efficiently.
- For each batch, it tokenizes the prompts, computes position IDs, and prepares the input data.
6. Response Generation
- The script generates responses for each prompt in the batch using the model.
- It handles dummy data to ensure the batch size is divisible by the data parallelism size.
7. Post-Processing
- The script removes dummy data and padding from the generated responses.
- It collects the responses and transposes the list to match the desired output format.
8. Saving Results
- The script adds the generated responses to the original dataset and saves the updated dataset to a new Parquet file.
Detailed Breakdown
1. Environment Setup
os.environ['NCCL_DEBUG'] = 'WARN'
os.environ['TOKENIZERS_PARALLELISM'] = 'true'
- Sets environment variables for debugging and enabling parallelism in tokenizers.
2. Model and Tokenizer Initialization
tokenizer = hf_tokenizer(local_path)
- Initializes a tokenizer from a local path using a custom function
hf_tokenizer
.
3. Dataset Reading
dataset = pd.read_parquet(config.data.path)
chat_lst = dataset[config.data.prompt_key].tolist()
- Reads the dataset from a Parquet file and extracts the prompts.
4. Ray Worker Setup
ray_cls_with_init = RayClassWithInitArgs(cls=ray.remote(ActorRolloutRefWorker), config=config, role='rollout')
resource_pool = RayResourcePool(process_on_nodes=[config.trainer.n_gpus_per_node] * config.trainer.nnodes)
wg = RayWorkerGroup(resource_pool=resource_pool, ray_cls_with_init=ray_cls_with_init)
wg.init_model()
- Sets up a Ray worker group with the specified configuration and initializes the model.
5. Batch Processing
for batch_idx in range(num_batch):
batch_chat_lst = chat_lst[batch_idx * config_batch_size:(batch_idx + 1) * config_batch_size]
inputs = tokenizer.apply_chat_template(batch_chat_lst, ...)
- Processes the dataset in batches, tokenizes the prompts, and prepares the input data.
6. Response Generation
output = wg.generate_sequences(data)
output_text = tokenizer.batch_decode(output.batch['input_ids'][:, -config.rollout.response_length:], ...)
- Generates responses for each batch and decodes the output tokens to text.
7. Post-Processing
output_text_unpad = [text.replace(pad_token, '') for text in output_text]
output_lst[i].extend(output_text_unpad)
- Removes padding from the generated responses and collects them.
8. Saving Results
dataset[f'responses'] = output_lst
dataset.to_parquet(config.data.output_path)
- Adds the generated responses to the dataset and saves it to a new Parquet file.
This script is designed to efficiently generate responses to a large dataset of prompts using distributed computing and natural language processing techniques.