Data Preparation - golololologol/LLM-Distillery GitHub Wiki

Data Preparation

The pipeline requires that your datasets adhere to a custom format. This standardization ensures consistent processing and simplifies the pipeline logic by reducing the need to handle multiple dataset formats.

Expected Dataset Format

Each dataset entry should be a JSON object with the following structure:

{
  "init": "Some system message",
  "conversations": [
    {"from": "human", "value": "User said hi"},
    {"from": "gpt", "value": "AI said hi in return"},
    {"from": "human", "value": "User makes a strong statement"}
  ],
  "source": "aaa_dataset",
  "tags": []
}

For completion tasks, you can use a simplified format where the role specified in from is not critical:

{
  "init": "",
  "conversations": [
    {"from": "human", "value": "Some completion text"}
  ],
  "source": "completion_dataset",
  "tags": ["completion"]
}

Field Descriptions

init:
This field holds the system message that sets the context for the conversation. It can be an empty string if no system prompt is needed.
conversations:
A list of conversation turns. Each turn is a JSON object that must include:
- from: Indicates the speaker, which should be either "human" or "gpt".
- value: Contains the actual text of the conversation turn.
  Note: For completion tasks, the speaker information in this field does not affect processing; simply include the "completion" tag in the tags list.
source:
The name or identifier of the dataset from which the conversation originates. This field is currently not used within the pipeline, but it can be very useful for filtering or referencing datasets when multiple sources are involved.
tags:
Tags provide additional metadata that determine how the conversation is processed:
- Without any tags, conversations are treated as instruct-based.
- Including the "completion" tag signals that the conversation should be processed as a completion task. In this mode, the conversation will not have extra prompt formatting added, nor will the system message be appended, even if provided.

Additional Resources

For added convenience, take a look at the utils/dataset_converter.py file. It may already include a function to convert your existing dataset into the required format.