VSCode Extension for On‐Device Code Assistance ‐ Model Integration Guide - quic/wos-ai-plugins GitHub Wiki
VSCode extension for helping developers write code with an AI code assistant. QAIRT Code works with Large Language Model for chat (LLAMA-v3) deployed on local device. The model runs completely on QAIRT NPU.
Generate the Llama 3.1 8b (and optionally Llama2 7b) models by following below instructions:
https://github.com/quic/ai-hub-apps/tree/main/tutorials/llm_on_genie
https://github.com/quic/ai-hub-models/tree/main/qai_hub_models/models/llama_v3_1_8b_chat_quantized
Brief commands to be run:
Login to HuggingFace cli
huggingface-cli login
Login to AIHub account
qai-hub configure --api_token <token>
Install qai_hub_models
python3.10 -m venv llm_on_genie_venv
source llm_on_genie_venv/bin/activate
pip install -U "qai_hub_models[llama-v3-1-8b-chat-quantized]"
Export Llama 3.1 models (Expected time: ~2hrs)
mkdir -p genie_bundle
python -m qai_hub_models.models.llama_v3_1_8b_chat_quantized.export --device "Snapdragon X Elite CRD" --skip-inferencing --skip-profiling --output-dir genie_bundle
After completion of the context binary generation, you will find the below QNN model files.
genie_bundle
|_ llama_v3_1_8b_chat_quantized_part_1_of_5.bin
|_ llama_v3_1_8b_chat_quantized_part_2_of_5.bin
|_ llama_v3_1_8b_chat_quantized_part_3_of_5.bin
|_ llama_v3_1_8b_chat_quantized_part_4_of_5.bin
|_ llama_v3_1_8b_chat_quantized_part_5_of_5.bin
Download qairt-code-completion-1.5.2.vsix
from the latest release.
Install the extension by opening VSCode and navigate to "Extensions" (in left-side menu)
> Click on "..." (3 dots) at the top
> Install from VSIX...
> select qairt-code-completion-x.y.z.vsix file
.
Copy the model files which were generated in step #1
to the qairt-code-completion extension folder.
Copy all 5 Llama 3.1 8b files to C:\Users\USER.vscode\extensions\quic.qairt-code-completion-1.5.2\out\server\models\llama-v3p1 folder as below.
llama-v3p1
|_ llama_v3_1_8b_chat_quantized_part_1_of_5.bin
|_ llama_v3_1_8b_chat_quantized_part_2_of_5.bin
|_ llama_v3_1_8b_chat_quantized_part_3_of_5.bin
|_ llama_v3_1_8b_chat_quantized_part_4_of_5.bin
|_ llama_v3_1_8b_chat_quantized_part_5_of_5.bin

With our VSCode extension, users can use the state-of-the-art Large Language Models (LLMs) directly within their code development environment.
The workflow of the extension is as shown in the image. Users can select models and submit queries through the extension's front end, which communicates with C++ backend via HTTP API. The backend then connects to the GENIE SDK, enabling large language models to run directly on edge devices.
In our extension, all models run on the NPU, which consumes less power than the CPU. This allows us to utilize powerful LLMs on-device without worrying about battery drain.

QAIRT Code utilizes Qualcomm's Gen AI Inference Extensions (GENIE) to run large language models directly on edge devices ensuring fast response times and data privacy.
Below is a brief explanation of the backend code, illustrating its flow, key features, and how we are utilizing the Genie SDK to run the language models.
- The program begins by reading the model configuration file in JSON format. It utilizes GenieDialogConfig_createFromJson to generate a dialog configuration from this JSON. To create a dialog for a model, a configuration file similar to the one below is required:
Click to expand JSON
{
"dialog": {
"version": 1,
"type": "basic",
"context": {
"version": 1,
"size": 4096,
"n-vocab": 128256,
"bos-token": 128006,
"eos-token": 128009
},
"sampler": {
"version": 1,
"seed": 42,
"temp": 0.8,
"top-k": 40,
"top-p": 0.95
},
"tokenizer": {
"version": 1,
"path": "models/llama-v3p1/tokenizer.json"
},
"engine": {
"version": 1,
"n-threads": 3,
"backend": {
"version": 1,
"type": "QnnHtp",
"QnnHtp": {
"version": 1,
"use-mmap": false,
"spill-fill-bufsize": 320000000,
"mmap-budget": 0,
"rope-theta":500000,
"poll": false,
"pos-id-dim": 64,
"cpu-mask": "0xe0",
"kv-dim": 128,
"allow-async-init": false
},
"extensions": "configs/htp_backend_ext_config.json"
},
"model": {
"version": 1,
"type": "binary",
"binary": {
"version": 1,
"ctx-bins": [
"models/llama-v3p1/8B-FT/weight_sharing_model_1_of_5.serialized.bin",
"models/llama-v3p1/8B-FT/weight_sharing_model_2_of_5.serialized.bin",
"models/llama-v3p1/8B-FT/weight_sharing_model_3_of_5.serialized.bin",
"models/llama-v3p1/8B-FT/weight_sharing_model_4_of_5.serialized.bin",
"models/llama-v3p1/8B-FT/weight_sharing_model_5_of_5.serialized.bin"
]
}
}
}
}
}
-
Setting Up the HTTP Server: The program initializes an HTTP server using the httplib library and sets up various API endpoints to handle incoming requests like /api/health(GET Request), /api/generate(POST Request) and /api/generate_stream (POST Request).
-
The /api/generate_stream ensures that real-time text can be sent to the client as it's generated. It resets the dialog and processes the query in chunks using a callback function. The response is streamed piece by piece to the client instead of waiting for the full response.
-
Handling Client Cancellations: If the client cancels a request mid-way, the server detects it and gracefully stops sending data.
Several functions from the Genie library are used to manage the lifecycle of the dialog. Here's a brief explanation of Genie library functions we used:
-
GenieDialogConfig_createFromJson: Creates a dialog configuration object from a JSON string. The configuration defines parameters and behaviors for the dialog system.
-
GenieDialog_create: Initializes a new dialog instance using the created configuration.
-
GenieDialog_reset: Resets the dialog to its initial state, clearing any previous context. It's essential to ensure that each user query is handled independently.
-
GenieDialog_query: Sends the user's input to the dialog system and retrieves a response.

-
GenieDialog_free: Releases the resources allocated for the dialog instance, ensuring efficient memory management.
-
GenieDialogConfig_free: Releases the resources allocated for the dialog configuration object.

On the extension side panel, choose your preferred model from the available options.
Current supported models:
- Llama v2 7b
- Llama v3.1 8b
Additionally, connection status shown on the VSCode Status Bar.
To check the connection manually, use the Check Connection
button located on the side panel.
- Open a New File: Launch VSCode and create a new file.
- Select a Profile: below are the profiles we support:
Profile | Description |
---|---|
Code Generation | Generates code based on the request you made in the query. |
Code Refactor | Refactors the existing code to improve its structure and readability. |
Test Expert | Generates test cases for the code, covering edge cases and more. |
Code Debugger | Identifies issues or potential errors in the code and suggests fixes. |
- Prepare Input: For the code generation profile, provide a simple query describing the code you want to generate. For other profiles, input the code you want to refactor, debug, or generate test cases for. For example:
write a python code to generate Fibonacci series
- Initiate Generation: Click on the "Generate" button in the QAIRT Code extension panel.
- AI Processing: The model will run using GENIE on your local device, leveraging the power of on-device AI to create the requested code.
- Review and Accept: Once generation is complete, review the suggested code. If you're satisfied, click the "Accept Code" button.
- Code Integration: The accepted code will be automatically inserted into your file, ready for further editing or execution.
- After using the model, click the "Stop Model" button to release NPU memory. If you don't stop the model and VS Code continues running in the background, the NPU memory will remain occupied.
For a clearer understanding of this process, refer to the GIF below, which demonstrates these steps in action:

The following steps guide you through integrating new AI models into the QAIRT Code Extension.
- Visual Studio 2022: Install Visual Studio 2022. Add the necessary Arm64 components while installing.
- Node js: Install Node.js from https://nodejs.org/en/download.
- Visual Studio Code installed.
- Access to Qualcomm's AI Hub models.
- Basic familiarity with C++/TypeScript development.
- CMake & npm installed for rebuilding.
1.1 Source the Model
Verify model availability on AI Hub Models Directory.
1.2 Convert Model
Follow the AI Hub conversion guide as outlined earlier in this tutorial under "How to Install Extension > Step 1".
Place the .bin files in the vs code installed path, default path:
C:\Users\<USER>\.vscode\extensions\quic.qairt-code-completion-<VERSION>\out\server\models\New_model_name\
2.1 Download Configuration
- Get the config file for the respective model from Genie Repository
- Store in extension config directory:
C:\Users\<USER>\.vscode\extensions\quic.qairt-code-completion-<VERSION>\out\server\configs\New_model_name\
2.2 Update Paths
- In the config, add the model paths. Ensure you include the paths for all the model chunks, and make sure the chunks are listed in the sequence.
For example:
"model": {
"version": 1,
"type": "binary",
"binary": {
"version": 1,
"ctx-bins": [
"models/llama-v3p1/8B-FT/weight_sharing_model_1_of_5.serialized.bin",
"models/llama-v3p1/8B-FT/weight_sharing_model_2_of_5.serialized.bin",
"models/llama-v3p1/8B-FT/weight_sharing_model_3_of_5.serialized.bin",
"models/llama-v3p1/8B-FT/weight_sharing_model_4_of_5.serialized.bin",
"models/llama-v3p1/8B-FT/weight_sharing_model_5_of_5.serialized.bin"
]
}
- Add tokenizer location as shown below:
"tokenizer": {
"version": 1,
"path": "models/llama-v3p1/tokenizer.json"
}
- Any other Model-specific parameters (Optional)
git clone https://github.com/quic/wos-ai-plugins.git
File: plugins/vscode/qualcomm-code-gen/server/src/qcom-code-gen-server.cpp
// change 1: In line 34 Add new enum entry in `Model` list
enum Model {
LLAMA2_7B,
LLAMA3_1_8B,
New_model_name
};
// change 2: At line 50 Update `modelEnumToStringMap` dictionary
std::map<Model, std::string> modelEnumToStringMap = {
{LLAMA2_7B, "llama-v2-7b"},
{LLAMA3_1_8B, "llama-v3.1-8b"},
{New_model_name, "new_model_string"}
};
// change 3: In line 71 in the function "getModelConfig" Configure model config path.
std::string getModelConfig(Model model) {
if (model == LLAMA2_7B) {
return "./configs/llama2-7b/llama2-7b-htp-windows.json";
}
else if (model == LLAMA3_1_8B) {
return "./configs/llama3p1-8b/llama3p1-8b-htp-windows.json";
}
else if (model == New_model_name) {
return "./new_model_json_path";
}
}
// change 4: In line 329 we have "get_query" function which is storing the string format of all the models.
// You can find the new model query format in the tokenizer.json file on Hugging Face, located under the "Files and versions" section.
std::string get_query(Model model, std::string system_prompt, std::string prompt) {
if (model == LLAMA3_1_8B) {
return std::format("<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n{}\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>", system_prompt, prompt);
}
else if (model == New_model_name){
return std::format("New model query format supported by GENIE");
}
return std::format("<s>[INST]\n<<SYS>>\n{}\n<</SYS>>\n\n{}[/INST]", system_prompt, prompt);
}
// change 5: Document new model support in the printUsage() function (line 127)
std::cout << std::setw(width) << " --model model";
std::cout << "Model to use. Possible values: llama-v2-7b, llama-v3.1-8b\n, New_model_name" << std::endl;
File 1: plugins\vscode\qualcomm-code-gen\package.json
// change 1: In Line 187 add New_model_name in the model list
"enum": ["existing_models", "New_model_name"]
File 2: plugins\vscode\qualcomm-code-gen\shared\model.ts
// change 1:Update ModelId and ModelName enums with new model
enum ModelId {
LLAMA_CHAT_V2_7B = "llama-v2-7b",
LLAMA_CHAT_V3_1_8B = "llama-v3.1-8b",
New_model = "new_model_id"
}
export enum ModelName {
LLAMA_CHAT_V2_7B = 'llama-v2-7b',
LLAMA_CHAT_V3_1_8B = "llama-v3.1-8b",
New_model = "new_model_id"
}
// change 2: Update the MODEL_NAME_TO_ID_MAP dictionary.
export const MODEL_NAME_TO_ID_MAP: Record<ModelName, ModelId> = {
[ModelName.LLAMA_CHAT_V2_7B]: ModelId.LLAMA_CHAT_V2_7B,
[ModelName.LLAMA_CHAT_V3_1_8B]: ModelId.LLAMA_CHAT_V3_1_8B,
New_model_entry
};
// change 3: Configure supported features.
export const MODEL_SUPPORTED_FEATURES: Record<ModelName, Features[]> = {
[ModelName.LLAMA_CHAT_V2_7B]: [Features.CODE_COMPLETION],
[ModelName.LLAMA_CHAT_V3_1_8B]: [Features.CODE_COMPLETION, Features.FIM],
New_model_entry : [new_model_feature_list]
};
File 3: plugins\vscode\qualcomm-code-gen\side-panel-ui\src\components\sections\ServerSection\ModelSelect\ModelSelect.tsx
// change 1: Add new option to model selection component
const options: SelectOptionProps<ModelName>[] = [
{ value: ModelName.LLAMA_CHAT_V2_7B },
{ value: ModelName.LLAMA_CHAT_V3_1_8B },
{ value: ModelName.New_model_name}
];
Note: Maintain model naming consistency between backend and frontend code updates
Create an QAIRT Code-Gen VSCode extension
./build.ps1 -qnn_sdk_root "C:\Qualcomm\AIStack\QAIRT\2.28.2.241116"
You will find qairt-code-completion-1.5.2.vsix extension file in this project directory, which can be installed in VSCode.
- Install rebuilt VSIX package in VSCode.
- Restart VSCode after installation.
- Verify that the new model appears in the UI and check its functionality to ensure it can generate output.
Current performance metrics:
---
config:
xyChart:
width: 500
height: 200
tickLength: 2
showTitle: False
---
xychart-beta horizontal
title "Model Performance Comparison"
x-axis ["Llama v3.1 8b", "Llama v2 7b"]
y-axis "Tokens/Second" 1 --> 15
bar [10, 12]
These rates ensure responsive code assistance while maintaining the benefits of on-device processing.