08. Data Preprocessing - AmirYunus/finetune_LLM GitHub Wiki

8.1 Load Dataset

dataset = load_dataset("mlabonne/FineTome-100k", split="train")

In this initial step, we load the dataset using the load_dataset function from the Hugging Face datasets library. This function provides a straightforward way to access various datasets, and in this case, we are specifically loading the "mlabonne/FineTome-100k" dataset for the training split.

8.2 Convert the Dataset to Hugging Face Format

dataset = standardize_sharegpt(dataset)

The FineTome-100k dataset uses the ShareGPT format for conversations, which requires conversion to the Hugging Face format for compatibility. The conversion process is essential for ensuring that the model can effectively interpret and utilise the data during training.

Example conversion from ShareGPT format:

{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}

To Hugging Face format:

{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}

The final conversation structure uses special tokens (Llama 3.1 format) from the Hugging Face format:

<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>[system message]<|eot_id|>
<|start_header_id|>user<|end_header_id|>[user message]<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>[assistant message]<|eot_id|>
<|end_of_text|>

This conversion is achieved through the standardize_sharegpt function, which standardises the format of responses generated by the ShareGPT model. This standardisation is vital for maintaining consistency and coherence in the data, ultimately enhancing the quality of the model's outputs.

8.3 Configure the Tokenizer to Use Llama 3.1 Instruct Chat Template

tokenizer = get_chat_template(
 tokenizer, 
    chat_template = "llama-3.1", 
)

Following the standardisation, we configure the tokeniser to utilise the "llama-3.1" chat template. The get_chat_template function is employed to set up the tokeniser with this specific template, which is designed to streamline the generation of responses in a structured manner. This configuration is particularly important for chat applications, where the format and style of responses must adhere to specific guidelines to ensure user engagement and satisfaction.

8.4 Format the Dataset for Training

def formatting_prompts_func(examples):
 convos = examples["conversations"]
 texts = [tokeniser.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

The next phase of preprocessing involves formatting the dataset for training. A custom function, formatting_prompts_func, is defined to transform the dataset's examples into a format suitable for model training. This function extracts conversations from the dataset and applies the chat template to each conversation, resulting in a list of formatted texts. The dataset is then mapped using this function, allowing for batch processing of the examples, which is more efficient than processing them individually.

8.5 Verify the Formatting of the Dataset

dataset[5]["conversations"]

Output:

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
'role': 'user'},
{'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analysing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
'role': 'assistant'}]
dataset[5]["text"]

Output:

Note: Llama 3.1 Instruct's default chat template adds "Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analysing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|eot_id|>'

Finally, we verify the formatting of the dataset by accessing specific entries within it. This verification step is crucial for ensuring that the preprocessing has been executed correctly and that the data is in the expected format before it is fed into the model for training. By checking the outputs for specific examples, we can confirm that the data is ready for the next stages of the training process.

⚠️ **GitHub.com Fallback** ⚠️