11. Save the LoRA Adapter - AmirYunus/finetune_LLM GitHub Wiki
The following lines of code are responsible for saving the fine-tuned model and tokeniser to a local directory. This is essential for preserving the changes made during the training process, allowing for later use without needing to retrain the model from scratch.
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")Output:
('lora_model/tokenizer_config.json', 'lora_model/special_tokens_map.json', 'lora_model/tokenizer.json')
This code saves the model's weights and configuration to the specified directory "lora_model". It is important to note that this method saves only the fine-tuning changes, not the complete model. Therefore, to utilise these saved weights later, the original base model must be available.
Additionally, the tokeniser's configuration and vocabulary are saved to the same directory "lora_model". The tokeniser is crucial for converting text to tokens and vice versa, ensuring that the model can properly interpret input data and generate output in a human-readable format.
def generate_response(content, model, tokenizer):
messages = [{"role": "user", "content": content}]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
input_ids=inputs,
streamer=text_streamer,
max_new_tokens=128,
use_cache=True,
temperature=1.5,
min_p=0.1
)This function generates a response from the model based on the provided user input. It takes three parameters: content, which is the input text from the user; model, which is the language model used for generating responses; and tokeniser, which is the tokeniser that converts text into tokens for the model.
The function begins by creating a list of messages that includes the user's input. The role indicates the type of message (in this case, from the user), and the content contains the actual text input. Next, the input for the model is prepared by applying the chat template using the tokeniser. This process includes tokenisation and formatting of the input for the model.
The apply_chat_template method of the tokeniser is called with parameters that specify tokenisation, the addition of a generation prompt, and the return of inputs as PyTorch tensors. The resulting tensors are then moved to the GPU for faster processing.
A TextStreamer instance is initialised to allow for real-time output visualisation, enabling the generated text to be streamed as it is produced by the model. The model's generate method is then called to produce a response based on the prepared input. This method includes parameters that limit the response to a maximum of 128 new tokens, enable caching to speed up the generation process, and set the randomness of the output.
Finally, an example usage of the generate_response function is provided, demonstrating how to call the function with a specific prompt about a tall tower in the capital of France.
generate_response("Describe a tall tower in the capital of France.", model, tokenizer)Example output:
There is no specific mention of a "tall tower" in the capital of France, as there are numerous towers throughout the country.<|eot_id>
Here, we demonstrate how to load a pre-trained language model that has been fine-tuned using Low-Rank Adaptation (LoRA), prepare it for inference, and generate a response based on a user prompt.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "lora_model",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)The FastLanguageModel.from_pretrained function is utilised to load the model and tokeniser. The model_name parameter specifies the name of the model to be loaded, which in this case is "lora_model." The max_seq_length parameter sets the maximum number of tokens that the model can process in a single input, ensuring that inputs do not exceed this length. The dtype parameter defines the data type for the model's parameters, which can impact performance and memory usage. Additionally, the load_in_4bit flag allows the model to be loaded in 4-bit precision, significantly reducing memory requirements while maintaining performance.
FastLanguageModel.for_inference(model)After loading the model, the for_inference method is called on the FastLanguageModel class. This method optimises the loaded model for inference, enabling it to run at twice the speed compared to its standard mode. This optimisation is particularly beneficial for real-time applications where quick responses are essential.
generate_response("Describe a tall tower in the capital of France.", model, tokenizer)Example output:
There are many tall towers in various capitals, but I think you may be referring to the Eiffel Tower in Paris, the capital of France. The Eiffel Tower is a 324-meter-tall (1,063 ft) iron lattice tower built for the 1889 World's Fair. It is a UNESCO World Heritage site and is considered an iconic symbol of France and Paris. The tower is used for observation, broadcasting, and other purposes, and is a popular tourist attraction.<|eot_id|>
The generate_response function is then invoked with a specific prompt asking for a description of a tall tower in the capital of France. This function leverages the loaded model and tokeniser to process the input prompt and generate a coherent response based on the model's training.
The output of this function call provides a detailed description of the Eiffel Tower, highlighting its significance as an iconic symbol of France and its historical context. This example illustrates the effectiveness of the LoRA adapter in enhancing the model's ability to generate contextually relevant and informative responses.