09. Fine‐Tuning - AmirYunus/finetune_LLM GitHub Wiki
In this section, we initialise the SFTTrainer, which is designed for supervised fine-tuning of language models. The trainer requires several parameters to be set, including the model, tokeniser, and training dataset.
trainer = SFTTrainer(
model = model,
tokeniser = tokeniser,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
dataset_num_proc = 2,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 60,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 42,
output_dir = "outputs",
report_to = "none",
),
)The SFTTrainer is initialised with several key components. The model parameter specifies the pre-trained model that will be fine-tuned on the specific dataset. The tokeniser is responsible for converting text into a format that the model can understand, such as token IDs. The train_dataset is the dataset used for training the model, which should be structured appropriately for the task. The dataset_text_field indicates which field in the dataset contains the text data to be processed, while max_seq_length sets the maximum length of input sequences to ensure that inputs do not exceed this length.
Additionally, the data_collator is used to prepare batches of data for training, specifically utilising DataCollatorForSeq2Seq for sequence-to-sequence tasks. The dataset_num_proc parameter indicates how many processes to use for processing the dataset, which can speed up data loading. The packing option, when set to True, allows for packing shorter sequences together, improving training efficiency.
The TrainingArguments class encapsulates various hyperparameters and settings for the training process. This includes per_device_train_batch_size, which defines the number of samples processed in one iteration per device (e.g., GPU), and gradient_accumulation_steps, which specifies the number of steps to accumulate gradients before updating the model weights. The warmup_steps parameter determines the number of steps during which the learning rate increases gradually, while max_steps sets the total number of training steps to perform.
The learning_rate parameter specifies the initial learning rate for the optimiser, and the flags fp16 and bf16 enable mixed precision training based on hardware capabilities. The logging_steps parameter controls the frequency of logging training metrics, and the optim parameter specifies the optimiser to use, in this case, an 8-bit version of AdamW for efficiency. The weight_decay parameter is used for regularisation to prevent overfitting, and lr_scheduler_type defines the type of learning rate scheduler to use. Finally, the seed parameter ensures reproducibility of results, while output_dir specifies the directory where model outputs (like checkpoints) will be saved. The report_to parameter indicates the logging/reporting method, which is set to 'none' to disable.
To enhance the model's ability to generate contextually relevant replies, we focus the training on responses only. This method isolates the training process, allowing for more targeted fine-tuning.
trainer = train_on_responses_only(
trainer,
instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)The code modifies the trainer object to focus on training the model specifically on the responses generated by the assistant. This is particularly useful in scenarios where the goal is to enhance the quality of the assistant's outputs without the influence of the user's prompts.
The function train_on_responses_only is invoked with the current trainer instance and two parameters: instruction_part and response_part.
-
instruction_part: This parameter is defined by the special token sequence"<|start_header_id|>user<|end_header_id|>\n\n". It indicates the part of the input that corresponds to the user's question. This sequence marks the beginning and end of the user input in the training data, allowing the model to identify which part of the data is user-generated. -
response_part: This parameter is defined by the special token sequence"<|start_header_id|>assistant<|end_header_id|>\n\n". It specifies the part of the input that corresponds to the assistant's response. Similar to theinstruction_part, this sequence marks the beginning and end of the assistant's output in the training data.
By using these parameters, the train_on_responses_only function effectively filters the training dataset to include only the responses from the assistant. This allows the model to learn from these outputs during the fine-tuning process, thereby improving its ability to generate contextually relevant and high-quality responses.
We can decode a sample input from the training dataset to verify its formatting. This step ensures that the data is correctly structured for training.
The following line of code uses the tokeniser to decode the input IDs from the training dataset at index 5. The input IDs represent the tokenised version of the text that the model will use for training. Decoding these IDs will convert them back into a human-readable string format, allowing us to inspect the actual text that corresponds to the token IDs stored in the dataset.
decoded_input = tokenizer.decode(trainer.train_dataset[5]["input_ids"])
decoded_inputOutput:
'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analysing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|eot_id|>'
The trainer object manages the training process, and train_dataset is the dataset being used for training. The index [5] specifies that we are interested in the sixth entry of the dataset (as indexing starts from 0). The "input_ids" key accesses the tokenised input text for that specific entry.
By decoding the input IDs, we can verify the content of the training data and ensure that it is formatted correctly for the model's training. This step is important for debugging and ensuring that the model receives the correct input format during training.
Similarly, we decode the labels to confirm their correctness. This is crucial for ensuring that the model learns from the right outputs during training.
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
decoded_labels = tokeniser.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])
decoded_labelsOutput:
' \n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analysing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|eot_id|>'
The first line retrieves the input ID for a space character from the tokeniser. This is done without adding any special tokens, as we only want the standard space character representation. The space token ID will be used later to replace any masked positions in the labels.
Next, we decode the labels from the training dataset at index 5. The labels are token IDs, and we need to convert them back to a human-readable format. However, in the labels, some values may be -100, which indicates that those positions should be masked or ignored during the decoding process.
We use a list comprehension to iterate through the labels. For each label, we check if it is -100. If it is, we replace it with the space token ID we retrieved earlier. Otherwise, we keep the original token ID. This effectively masks out the -100 values while preserving the rest of the labels.
Finally, we decode the modified list of token IDs back into a string format, which allows us to see the actual text representation of the labels, excluding the masked positions. This decoding process is essential for verifying the correctness of the labels and ensuring that the model learns from the appropriate outputs during training.
Monitoring GPU memory usage is essential during training. This section retrieves and displays the current memory statistics of the GPU being used.
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")Example Output:
GPU = NVIDIA GeForce RTX 4070 Ti SUPER. Max memory = 15.695 GB. 2.635 GB of memory reserved.
The code snippet is responsible for monitoring GPU memory usage during the training process of a model using PyTorch.
First, the code retrieves the properties of the GPU device at index 0 (the first GPU in the system). The gpu_stats variable will hold various attributes of the GPU, including its name and total memory capacity.
Next, the code calculates the amount of GPU memory that is currently reserved by the process. The torch.cuda.max_memory_reserved() function returns the reserved memory in bytes. To convert this value to gigabytes, it is divided by 1024 three times (to convert bytes to gigabytes). The result is rounded to three decimal places for clarity.
Then, the code retrieves the total memory available on the GPU from the gpu_stats object. Similar to the previous calculation, the total memory is converted from bytes to gigabytes by dividing it by 1024 three times and rounded to three decimal places.
The code then prints the name of the GPU and its maximum memory capacity in gigabytes. This information is useful for users to quickly understand the hardware being utilised for training.
Finally, the code prints the amount of memory that is currently reserved for the process. Monitoring this value is important during training or inference to ensure that the GPU is being utilised efficiently and to avoid running out of memory.
The training process is initiated here. The train() method of the trainer object is called to start the fine-tuning process.
trainer_stats = trainer.train()Example Output:
{'loss': 1.0564, 'grad_norm': 0.20524080097675323, 'learning_rate': 4e-05, 'epoch': 0.0} {'loss': 1.1463, 'grad_norm': 0.33220812678337097, 'learning_rate': 8e-05, 'epoch': 0.0} {'loss': 0.9626, 'grad_norm': 0.28336429595947266, 'learning_rate': 0.00012, 'epoch': 0.0} Output is truncated...
The line trainer.train() calls the train() method of the trainer object. This method is crucial as it starts the fine-tuning process for the model. The trainer object is typically pre-configured with various essential parameters, including the dataset to be used for training, the hyperparameters that dictate the training behaviour (like learning rate, batch size, etc.), and the architecture of the model itself.
The train() method encapsulates the entire training loop. This includes:
- Forward Passes: The model processes the input data to make predictions.
- Loss Calculation: The difference between the model's predictions and the actual target values is computed to determine how well the model is performing.
- Backpropagation: The gradients of the loss with respect to the model parameters are calculated, allowing the model to learn from its mistakes.
- Optimisation Steps: The model parameters are updated based on the computed gradients to minimise the loss.
The output of the training process is stored in the trainer_stats variable. This variable is expected to contain various metrics and statistics related to the training session, such as:
- Training Loss: A measure of how well the model is performing during training.
- Evaluation Metrics: Additional metrics that may be used to assess the model's performance on validation data.
- Runtime Information: Details about the duration of the training process and other relevant statistics.
After training, we display the final memory usage and training time statistics. This information is vital for understanding the resource consumption during the training process.
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime'] / 60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")Example Output:
124.4697 seconds used for training. 2.07 minutes used for training. Peak reserved memory = 3.676 GB. Peak reserved memory for training = 1.041 GB. Peak reserved memory % of max memory = 23.421 %. Peak reserved memory for training % of max memory = 6.633 %.
This code snippet calculates the total amount of GPU memory currently reserved by the process using the max_memory_reserved function from the torch.cuda module, which returns the reserved memory in bytes. The value is converted to gigabytes (GB) by dividing by 1024 three times (bytes to kilobytes to megabytes to gigabytes). The result is rounded to three decimal places for clarity.
Next, the code calculates the amount of GPU memory specifically used for the LoRA (Low-Rank Adaptation) model by subtracting the initial reserved memory (start_gpu_memory) from the total used memory (used_memory). This value is also rounded to three decimal places.
The percentage of the total GPU memory currently being used is calculated by dividing the used_memory by the max_memory available on the GPU and multiplying by 100 to express it as a percentage. The result is rounded to three decimal places.
Similarly, the code computes the percentage of the total GPU memory being used for the LoRA model by dividing the memory used for LoRA (used_memory_for_lora) by the max_memory and multiplying by 100 to get the percentage, rounding the result to three decimal places.
The total time taken for training is printed in seconds and retrieved from the trainer_stats object, which contains metrics related to the training session. Additionally, the total training time is printed in minutes, converting seconds to minutes by dividing by 60 and rounding to two decimal places for a more human-readable format.
The peak reserved memory in gigabytes is outputted, providing an overview of the maximum memory usage during the training process. The peak reserved memory specifically for the LoRA model is also printed in gigabytes, helping to understand how much memory the LoRA adaptation is consuming.
Finally, the peak reserved memory is expressed as a percentage of the maximum memory available on the GPU, which is useful for assessing how efficiently the GPU memory is being utilised. The peak reserved memory for training is also printed as a percentage of the maximum memory, providing insight into the memory overhead introduced by the LoRA model.