10. Inference - AmirYunus/finetune_LLM GitHub Wiki

10.1 Set Model for Inference

The following line of code is used to enable a faster inference mode for the FastLanguageModel. This mode is optimised to run inference tasks at twice the speed compared to the standard mode. By calling the for_inference method on the FastLanguageModel class and passing the model as an argument, we are preparing the model for efficient inference operations, which is particularly useful when generating responses or predictions in real-time applications.

FastLanguageModel.for_inference(model)

10.2 Create an Input for Inference

The following code defines a list of messages to be sent to the model for processing. Each message is represented as a dictionary with two keys: 'role' and 'content'. The 'role' indicates who is sending the message (in this case, the user), and the 'content' contains the actual text of the message.

messages = [
 {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]

Here, the user is asking the model to continue the Fibonacci sequence, which is a well-known mathematical series where each number is the sum of the two preceding ones. The sequence starts with 1, 1, 2, 3, 5, 8, and the user is requesting the next number(s) in the series.

10.3 Prepare the Input for the Model

The following line of code prepares the input for the model by applying a chat template to the messages. This preparation is crucial for ensuring that the model can effectively understand and process the input data.

inputs = tokenizer.apply_chat_template(
 messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

The tokeniser.apply_chat_template(...) method is called on the tokeniser object, which is responsible for converting text into a format that the model can understand. This method applies a specific formatting template to the messages, ensuring they are structured correctly for the model's input requirements.

The parameters used in this method are essential for the input preparation. The messages parameter is a list of messages that includes both user and assistant roles, typically structured as dictionaries with keys indicating the role (e.g., "user" or "assistant") and the content of the message.

The tokenize=True flag indicates that the messages should be tokenised, which is the process of converting text into tokens (numerical representations) that the model can process. By setting this to True, the tokeniser will break down the messages into the appropriate tokens.

The add_generation_prompt=True parameter ensures that a generation prompt is added to the input, informing the model that it should generate a response based on the provided messages. Without this prompt, the model may not understand that it needs to produce an output.

The return_tensors="pt" specifies that the output of the apply_chat_template method should be returned as PyTorch tensors, which is necessary for compatibility with the model.

Finally, the .to("cuda") method call moves the resulting tensors to the GPU (if available), significantly speeding up inference times. Utilising the GPU for processing is essential for real-time applications where quick responses are necessary.

10.4 Create a TextStreamer Object for Real-Time Output

The TextStreamer class is responsible for managing the output of the model as it generates text. This class allows for a more interactive experience by streaming the output in real-time, meaning that tokens are displayed one by one rather than waiting for the entire response to be generated before showing any output.

text_streamer = TextStreamer(
 tokeniser,
    skip_prompt=True
)

The tokeniser is passed as an argument to the TextStreamer. This tokeniser is crucial because it converts the generated tokens (which are numerical representations) back into human-readable text. This conversion is essential for users to understand the output generated by the model.

The skip_prompt parameter is set to True. This indicates that the initial prompt text should be skipped in the output. When this parameter is enabled, the TextStreamer will not display the prompt text that initiated the generation process. Instead, it will focus solely on the content generated by the model, providing a cleaner and more focused output for the user.

10.5 Generate Inference from the Model

The following line of code generates outputs from the model using the provided input IDs. This is a crucial step in the inference process, where the model produces predictions based on the input data.

_ = model.generate(
    input_ids = inputs,
    streamer = text_streamer,
    max_new_tokens = 128,
    use_cache = True,
    temperature = 1.5,
    min_p = 0.1
)

Example output:

Here is the continued Fibonacci sequence:

1, 1, 2, 3, 5, 8, 13, 21, 34

The generate method of the model is called with several parameters to control the generation process. The input_ids parameter represents the tokenised messages that the model will process. The streamer parameter is an instance of TextStreamer, which allows for real-time visualisation of the output as it is generated. The max_new_tokens parameter limits the number of new tokens that can be generated in the output, ensuring that the response does not exceed a specified length.

The use_cache flag, when set to True, enables the model to reuse previously computed results, which can significantly speed up the generation process. The temperature parameter controls the randomness of the output; a higher value, such as 1.5, results in more diverse and creative outputs, while a lower value leads to more deterministic and focused responses. Lastly, the min_p parameter sets a minimum probability threshold for token selection, ensuring that only tokens with a probability above this threshold are considered for generation. These parameters are optimised based on empirical testing - see: https://x.com/menhguin/status/1826132708508213629