06. Load Pre‐Trained Model and Tokenizer - AmirYunus/finetune_LLM GitHub Wiki

Here, we will explore the process of loading a pre-trained language model and its associated tokeniser using the FastLanguageModel class from the unsloth library. This functionality is crucial for leveraging existing models that have already been trained on large datasets, allowing us to utilise their capabilities without the need to start from scratch. The model we will be loading is specifically named "unsloth/Llama-3.2-3B-Instruct," which is designed for instruction-based tasks.

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", 
    max_seq_length = max_seq_length, 
    dtype = dtype, 
    load_in_4bit = load_in_4bit, 
)

6.1 Maximum Sequence Length

The maximum sequence length is a critical parameter that defines how many tokens the model can process in a single input. In our implementation, we set this value to 2048 tokens. This choice is made to ensure that the model can handle a substantial amount of input data while maintaining efficiency in memory usage and computational performance. The ability to process longer sequences is particularly beneficial in natural language processing tasks, where context can significantly influence the quality of the output.

6.2 Data Type

The data type used for model computations can greatly affect performance and memory usage. In our setup, we have opted to set the data type to None, which allows the system to automatically detect the optimal data type based on the hardware capabilities. This flexibility is essential for maximising the efficiency of the model, especially when running on different types of hardware, such as GPUs or TPUs, which may have varying support for different data types.

6.3 4-bit Quantisation

To further optimise the model's performance, we enable 4-bit quantisation. This technique significantly reduces the memory footprint of the model—by approximately 75%—making it feasible to run on consumer-grade GPUs while still maintaining a high level of model quality. The quantisation process involves converting the model weights to a lower precision format, which allows for faster computations and reduced memory usage without a substantial loss in accuracy. This is particularly advantageous in scenarios where computational resources are limited.

6.4 FastLanguageModel Class

The FastLanguageModel class serves as the backbone for loading and utilising pre-trained models. It is optimised for efficient training and inference of large language models (LLMs). By using this class, we can easily load the specified model and tokeniser, which are essential for processing input data and generating outputs. The class abstracts away many of the complexities involved in model management, allowing developers to focus on higher-level tasks such as fine-tuning and application development.

6.5 Pre-trained Model Selection

Selecting the appropriate pre-trained model is a vital step in the process. In our case, we have chosen the "unsloth/Llama-3.2-3B-Instruct" model, which is tailored for instruction-based tasks. This model has been trained on a diverse dataset, enabling it to understand and generate human-like text effectively. The choice of model can significantly impact the performance of the application, making it essential to consider the specific requirements of the task at hand when making this selection.