07. Adding LoRA Adapters - AmirYunus/finetune_LLM GitHub Wiki

model = FastLanguageModel.get_peft_model(
 model,
    r = 16,
    target_modules = [
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
 ],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
    use_rslora = False,
    loftq_config = None,
)

7.1 Rank of LoRA Adapters

The rank of LoRA (Low-Rank Adaptation) adapters is a crucial parameter that determines the number of parameters that will be updated during the fine-tuning process. In the context of the FastLanguageModel, the rank is set to 16, which is a commonly suggested value for various tasks. This rank allows for a balance between model performance and computational efficiency, enabling the model to adapt effectively without overwhelming the available resources. The choice of rank can significantly impact the model's ability to learn from the data, with higher ranks allowing for more complex adaptations but also requiring more computational power and memory.

7.2 Target Model Components

When applying LoRA adapters, it is essential to specify the target model components to which these adapters will be applied. In the FastLanguageModel, the target modules include various projection layers that play critical roles in the model's architecture. The following is a detailed explanation of each target module:

  • q_proj: This is the query projection layer. It transforms the input data into a query representation that is used in the attention mechanism. The quality of the query representation is crucial for determining how well the model can focus on relevant parts of the input when generating responses.

  • k_proj: The key projection layer is responsible for transforming the input data into key representations. These keys are compared against the query representations to determine the relevance of different parts of the input. The effectiveness of the attention mechanism heavily relies on the quality of the key representations.

  • v_proj: This is the value projection layer, which transforms the input data into value representations. The values are the actual information that will be retrieved based on the attention scores calculated from the queries and keys. Properly projecting the values is essential for ensuring that the model can generate accurate and contextually relevant outputs.

  • o_proj: The output projection layer takes the results of the attention mechanism and transforms them into the final output representation. This layer is crucial for ensuring that the model's outputs are in the correct format and contain the necessary information derived from the input data.

  • gate_proj: The gate projection layer controls the flow of information within the model. It acts as a mechanism to determine how much information should be passed through to the next layers, effectively regulating the model's capacity to learn and adapt based on the input data.

  • up_proj: This layer is responsible for increasing the dimensionality of the input data. It allows the model to expand the representation space, which can be beneficial for capturing more complex patterns and relationships within the data.

  • down_proj: Conversely, the down projection layer reduces the dimensionality of the input data. This is important for compressing the information and making it more manageable for subsequent processing steps. It helps in maintaining a balance between model complexity and computational efficiency.

By selectively applying LoRA to these components, the model can achieve efficient fine-tuning, focusing on the most critical parts of the architecture while minimising the overall computational burden. This targeted approach allows for effective adaptation to specific tasks, enhancing the model's performance without the need to retrain the entire architecture.

7.3 LoRA Alpha

The LoRA alpha parameter serves as a scaling factor for the updates made by the LoRA adapters. In the FastLanguageModel, this value is set to 16, which controls the contribution of the LoRA parameters to the overall model output. A well-chosen alpha value can enhance the model's performance by ensuring that the updates from the LoRA adapters are appropriately weighted in relation to the original model parameters. This balance is vital for maintaining the integrity of the model's learned representations while allowing for effective adaptation to new tasks.

7.4 LoRA Dropout

Dropout is a regularisation technique used to prevent overfitting during training. In our case, the dropout rate is set to 0, indicating that no dropout is applied to the LoRA layers. This choice is optimised for performance, allowing the model to leverage all available parameters during training. While dropout can be beneficial in certain scenarios, the decision to disable it, in this case, reflects a focus on maximising the model's capacity to learn from the training data without introducing additional noise.

7.5 Bias

The handling of bias in LoRA layers is another important consideration. In our example, the bias is set to "none", which is optimised for performance. This means that the LoRA adapters do not introduce additional bias terms, simplifying the model's architecture and potentially improving training efficiency. However, it is essential to note that other options for bias handling are available, and the choice should be guided by the specific requirements of the task at hand.

7.6 Gradient Checkpointing

Gradient checkpointing is a memory-saving technique that allows for larger batch sizes during training by reducing the memory footprint of the model. In this example, the option for gradient checkpointing is set to "unsloth", which is specifically optimised to use 30% less VRAM compared to standard methods. This optimisation is particularly beneficial for training large models on consumer-grade GPUs, as it enables more efficient use of available resources while maintaining model performance.

7.7 Random State

Setting a random state is crucial for ensuring reproducibility in machine learning experiments. In this example, the random state is fixed at 42, which allows for consistent results across different runs of the training process. This practice is essential for validating the effectiveness of the model and for comparing results across various configurations and datasets.

7.8 Rank Stabilised LoRA (RSLoRA)

Rank Stabilised LoRA (RSLoRA) is an advanced technique designed to maintain stability during the training of LoRA adapters. In our case, the option to use RSLoRA is set to False, indicating that the standard LoRA approach is employed. While RSLoRA can provide benefits in certain scenarios, the choice to disable focuses on simplicity and the belief that the standard approach is sufficient for the tasks being addressed.

7.9 LoftQ Configuration

The LoftQ configuration parameter allows for additional customisation of the LoRA adapters. In this case, this parameter is set to None, which means that the model will not apply any specific LoftQ configurations during training. This choice may be appropriate for many use cases, but users should consider their specific requirements and the potential benefits of customising this parameter based on their training objectives.