TinyZero:dataset - chunhualiao/public-docs GitHub Wiki

TinyZero

test small dataset

{
  "target": 36,
  "nums": [
    79,
    17,
    60
  ],
  "data_source": "countdown",
  "prompt": [
    {
      "content": "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: Using the numbers [79, 17, 60], create an equation that equals 36. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.\nAssistant: Let me solve this step by step.\n<think>",
      "role": "user"
    }
  ],
  "ability": "math",
  "reward_model": {
    "ground_truth": {
      "numbers": [
        79,
        17,
        60
      ],
      "target": 36
    },
    "style": "rule"
  },
  "extra_info": {
    "index": 0,
    "split": "test"
  }
}

another view

[liaoch@login17:~/workspace-scratch/TinyZero]python my_dataset/read_test_data.py

Prompt examples:

Prompt 0: [{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: Using the numbers [79, 17, 60], create an equation that equals 36. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in tags. And return the final answer in tags, for example (1 + 2) / 3 .\nAssistant: Let me solve this step by step.\n', 'role': 'user'}]

Prompt 1: [{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: Using the numbers [11, 34, 82, 80], create an equation that equals 56. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in tags. And return the final answer in tags, for example (1 + 2) / 3 .\nAssistant: Let me solve this step by step.\n', 'role': 'user'}]

Prompt 2: [{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: Using the numbers [51, 4, 60, 35], create an equation that equals 49. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in tags. And return the final answer in tags, for example (1 + 2) / 3 .\nAssistant: Let me solve this step by step.\n', 'role': 'user'}]

Prompt 3: [{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: Using the numbers [34, 98, 1, 96], create an equation that equals 33. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in tags. And return the final answer in tags, for example (1 + 2) / 3 .\nAssistant: Let me solve this step by step.\n', 'role': 'user'}]

Prompt 4: [{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: Using the numbers [46, 9, 49, 56], create an equation that equals 29. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in tags. And return the final answer in tags, for example (1 + 2) / 3 .\nAssistant: Let me solve this step by step.\n', 'role': 'user'}]

Does the Code Use Pre-Existing Data or Data Generated by the Agent Interacting with an Environment?

Summary

The code uses pre-existing data from a dataset rather than data generated solely through an agent interacting with a dynamic environment. The training process is dataset-driven, with prompts and responses coming from pre-loaded .parquet files.

How to Check This in the Code and Configuration

1. Dataset Specification in train_tiny_zero.sh

  • The script specifies training and validation data files using:
    --data.train_files=$DATA_DIR/train.parquet
    --data.val_files=$DATA_DIR/test.parquet
  • These files contain a pre-existing dataset of prompts and responses.

2. Data Loading: RLHFDataset Class

  • The RLHFDataset class (located in verl/utils/dataset/rl_dataset.py) is responsible for reading these .parquet files.
  • It does not interact with an external environment; it only loads data from the dataset.

3. DataLoader Creation in RayPPOTrainer

  • The _create_dataloader() method in verl/trainer/ppo/ray_trainer.py:
    self.train_loader = self._create_dataloader(self.config.data.train_files)
  • This function creates instances of RLHFDataset and DataLoader, ensuring that training data comes from pre-existing files.

4. Training Process in RayPPOTrainer.fit()

  • The fit() method loads the dataset through the DataLoader and passes it into the training loop:
    gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
  • Key detail:
    • The agent (actor) generates responses during training.
    • However, the prompts for these generations come from pre-existing data, not an external interactive environment.

Is There Any Environment Interaction?

  • Traditional RL environments (e.g., a game simulator) are not used.
  • The "environment" is implicitly defined by the dataset.
  • The agent learns by responding to prompts in the dataset and receiving rewards from the reward function.

Conclusion

The training process is dataset-driven:

  • Prompts come from .parquet files (pre-existing dataset).
  • The agent generates responses but does not create new prompts dynamically.
  • No interaction with an external RL environment like a game or simulator.

🚀 To verify this behavior in the code, check:

  • Dataset paths in train_tiny_zero.sh (--data.train_files, --data.val_files).
  • RLHFDataset implementation (loads .parquet data).
  • RayPPOTrainer._create_dataloader() (confirms dataset loading).
  • RayPPOTrainer.fit() (agent generates responses based on dataset prompts).

📌 Bottom line: The agent learns from a static dataset rather than an evolving environment. 🎯

⚠️ **GitHub.com Fallback** ⚠️