TinyZero:dataset - chunhualiao/public-docs GitHub Wiki
{
"target": 36,
"nums": [
79,
17,
60
],
"data_source": "countdown",
"prompt": [
{
"content": "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: Using the numbers [79, 17, 60], create an equation that equals 36. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.\nAssistant: Let me solve this step by step.\n<think>",
"role": "user"
}
],
"ability": "math",
"reward_model": {
"ground_truth": {
"numbers": [
79,
17,
60
],
"target": 36
},
"style": "rule"
},
"extra_info": {
"index": 0,
"split": "test"
}
}
[liaoch@login17:~/workspace-scratch/TinyZero]python my_dataset/read_test_data.py
Prompt examples:
Prompt 0: [{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: Using the numbers [79, 17, 60], create an equation that equals 36. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in tags. And return the final answer in tags, for example (1 + 2) / 3 .\nAssistant: Let me solve this step by step.\n', 'role': 'user'}]
Prompt 1: [{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: Using the numbers [11, 34, 82, 80], create an equation that equals 56. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in tags. And return the final answer in tags, for example (1 + 2) / 3 .\nAssistant: Let me solve this step by step.\n', 'role': 'user'}]
Prompt 2: [{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: Using the numbers [51, 4, 60, 35], create an equation that equals 49. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in tags. And return the final answer in tags, for example (1 + 2) / 3 .\nAssistant: Let me solve this step by step.\n', 'role': 'user'}]
Prompt 3: [{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: Using the numbers [34, 98, 1, 96], create an equation that equals 33. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in tags. And return the final answer in tags, for example (1 + 2) / 3 .\nAssistant: Let me solve this step by step.\n', 'role': 'user'}]
Prompt 4: [{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: Using the numbers [46, 9, 49, 56], create an equation that equals 29. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in tags. And return the final answer in tags, for example (1 + 2) / 3 .\nAssistant: Let me solve this step by step.\n', 'role': 'user'}]
The code uses pre-existing data from a dataset rather than data generated solely through an agent interacting with a dynamic environment. The training process is dataset-driven, with prompts and responses coming from pre-loaded .parquet
files.
- The script specifies training and validation data files using:
--data.train_files=$DATA_DIR/train.parquet --data.val_files=$DATA_DIR/test.parquet
- These files contain a pre-existing dataset of prompts and responses.
- The
RLHFDataset
class (located inverl/utils/dataset/rl_dataset.py
) is responsible for reading these.parquet
files. - It does not interact with an external environment; it only loads data from the dataset.
- The
_create_dataloader()
method inverl/trainer/ppo/ray_trainer.py
:self.train_loader = self._create_dataloader(self.config.data.train_files)
- This function creates instances of
RLHFDataset
andDataLoader
, ensuring that training data comes from pre-existing files.
- The
fit()
method loads the dataset through theDataLoader
and passes it into the training loop:gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
-
Key detail:
- The agent (actor) generates responses during training.
- However, the prompts for these generations come from pre-existing data, not an external interactive environment.
- Traditional RL environments (e.g., a game simulator) are not used.
- The "environment" is implicitly defined by the dataset.
- The agent learns by responding to prompts in the dataset and receiving rewards from the reward function.
✅ The training process is dataset-driven:
-
Prompts come from
.parquet
files (pre-existing dataset). - The agent generates responses but does not create new prompts dynamically.
- No interaction with an external RL environment like a game or simulator.
🚀 To verify this behavior in the code, check:
-
Dataset paths in
train_tiny_zero.sh
(--data.train_files
,--data.val_files
). -
RLHFDataset
implementation (loads.parquet
data). -
RayPPOTrainer._create_dataloader()
(confirms dataset loading). -
RayPPOTrainer.fit()
(agent generates responses based on dataset prompts).
📌 Bottom line: The agent learns from a static dataset rather than an evolving environment. 🎯