VSCode extension for helping developers write code with an AI code assistant. QAIRT Code works with Large Language Model for chat (LLAMA-v3) deployed on local device. The model runs completely on QAIRT NPU.

1. How to Install Extension

Step 1: Get Llama 3.1 8b models

Generate the Llama 3.1 8b (and optionally Llama2 7b) models by following below instructions:

https://github.com/quic/ai-hub-apps/tree/main/tutorials/llm_on_genie

https://github.com/quic/ai-hub-models/tree/main/qai_hub_models/models/llama_v3_1_8b_chat_quantized

Brief commands to be run:

huggingface-cli login

qai-hub configure --api_token <token>

Install qai_hub_models

python3.10 -m venv llm_on_genie_venv
source llm_on_genie_venv/bin/activate
pip install -U "qai_hub_models[llama-v3-1-8b-chat-quantized]"

Export Llama 3.1 models (Expected time: ~2hrs)

mkdir -p genie_bundle

python -m qai_hub_models.models.llama_v3_1_8b_chat_quantized.export --device "Snapdragon X Elite CRD" --skip-inferencing --skip-profiling --output-dir genie_bundle

After completion of the context binary generation, you will find the below QNN model files.

genie_bundle
  |_  llama_v3_1_8b_chat_quantized_part_1_of_5.bin
  |_  llama_v3_1_8b_chat_quantized_part_2_of_5.bin
  |_  llama_v3_1_8b_chat_quantized_part_3_of_5.bin
  |_  llama_v3_1_8b_chat_quantized_part_4_of_5.bin
  |_  llama_v3_1_8b_chat_quantized_part_5_of_5.bin

Step 2: Install Extension

Download qairt-code-completion-1.5.2.vsix from the latest release.

Install the extension by opening VSCode and navigate to "Extensions" (in left-side menu) > Click on "..." (3 dots) at the top > Install from VSIX... > select qairt-code-completion-x.y.z.vsix file.

Step 3: Copy QNN model files

Copy the model files which were generated in step #1 to the qairt-code-completion extension folder. Copy all 5 Llama 3.1 8b files to C:\Users\USER.vscode\extensions\quic.qairt-code-completion-1.5.2\out\server\models\llama-v3p1 folder as below.

llama-v3p1
  |_  llama_v3_1_8b_chat_quantized_part_1_of_5.bin
  |_  llama_v3_1_8b_chat_quantized_part_2_of_5.bin
  |_  llama_v3_1_8b_chat_quantized_part_3_of_5.bin
  |_  llama_v3_1_8b_chat_quantized_part_4_of_5.bin
  |_  llama_v3_1_8b_chat_quantized_part_5_of_5.bin

2. How the Extension Works

With our VSCode extension, users can use the state-of-the-art Large Language Models (LLMs) directly within their code development environment.

The workflow of the extension is as shown in the image. Users can select models and submit queries through the extension's front end, which communicates with C++ backend via HTTP API. The backend then connects to the GENIE SDK, enabling large language models to run directly on edge devices.

In our extension, all models run on the NPU, which consumes less power than the CPU. This allows us to utilize powerful LLMs on-device without worrying about battery drain.

2.1 Understanding the Backend: Code Flow, and Genie SDK Integration

QAIRT Code utilizes Qualcomm's Gen AI Inference Extensions (GENIE) to run large language models directly on edge devices ensuring fast response times and data privacy.

Below is a brief explanation of the backend code, illustrating its flow, key features, and how we are utilizing the Genie SDK to run the language models.

The program begins by reading the model configuration file in JSON format. It utilizes GenieDialogConfig_createFromJson to generate a dialog configuration from this JSON. To create a dialog for a model, a configuration file similar to the one below is required:

Click to expand JSON

{
    "dialog": {
        "version": 1,
        "type": "basic",
        "context": {
            "version": 1,
            "size": 4096,
            "n-vocab": 128256,
            "bos-token": 128006,
            "eos-token": 128009
        },
        "sampler": {
            "version": 1,
            "seed": 42,
            "temp": 0.8,
            "top-k": 40,
            "top-p": 0.95
        },
        "tokenizer": {
            "version": 1,
            "path": "models/llama-v3p1/tokenizer.json"
        },
        "engine": {
            "version": 1,
            "n-threads": 3,
            "backend": {
                "version": 1,
                "type": "QnnHtp",
                "QnnHtp": {
                    "version": 1,
                    "use-mmap": false,
                    "spill-fill-bufsize": 320000000,
                    "mmap-budget": 0,
            	    "rope-theta":500000,
                    "poll": false,
                    "pos-id-dim": 64,
                    "cpu-mask": "0xe0",
                    "kv-dim": 128,
                    "allow-async-init": false
                },
                "extensions": "configs/htp_backend_ext_config.json"
            },
            "model": {
                "version": 1,
                "type": "binary",
                "binary": {
                    "version": 1,
                    "ctx-bins": [
                        "models/llama-v3p1/8B-FT/weight_sharing_model_1_of_5.serialized.bin",
                        "models/llama-v3p1/8B-FT/weight_sharing_model_2_of_5.serialized.bin",
                        "models/llama-v3p1/8B-FT/weight_sharing_model_3_of_5.serialized.bin",
                        "models/llama-v3p1/8B-FT/weight_sharing_model_4_of_5.serialized.bin",
                        "models/llama-v3p1/8B-FT/weight_sharing_model_5_of_5.serialized.bin"
                    ]               
         	}
            }
        }
    }
}

Setting Up the HTTP Server: The program initializes an HTTP server using the httplib library and sets up various API endpoints to handle incoming requests like /api/health(GET Request), /api/generate(POST Request) and /api/generate_stream (POST Request).
The /api/generate_stream ensures that real-time text can be sent to the client as it's generated. It resets the dialog and processes the query in chunks using a callback function. The response is streamed piece by piece to the client instead of waiting for the full response.
Handling Client Cancellations: If the client cancels a request mid-way, the server detects it and gracefully stops sending data.

Several functions from the Genie library are used to manage the lifecycle of the dialog. Here's a brief explanation of Genie library functions we used:

GenieDialogConfig_createFromJson: Creates a dialog configuration object from a JSON string. The configuration defines parameters and behaviors for the dialog system.
GenieDialog_create: Initializes a new dialog instance using the created configuration.
GenieDialog_reset: Resets the dialog to its initial state, clearing any previous context. It's essential to ensure that each user query is handled independently.
GenieDialog_query: Sends the user's input to the dialog system and retrieves a response.

GenieDialog_free: Releases the resources allocated for the dialog instance, ensuring efficient memory management.
GenieDialogConfig_free: Releases the resources allocated for the dialog configuration object.

3. How to run the extension

Starting Server

On the extension side panel, choose your preferred model from the available options.

Current supported models:

Llama v2 7b
Llama v3.1 8b

Additionally, connection status shown on the VSCode Status Bar. To check the connection manually, use the Check Connection button located on the side panel.

Running the model

Open a New File: Launch VSCode and create a new file.
Select a Profile: below are the profiles we support:

Profile	Description
Code Generation	Generates code based on the request you made in the query.
Code Refactor	Refactors the existing code to improve its structure and readability.
Test Expert	Generates test cases for the code, covering edge cases and more.
Code Debugger	Identifies issues or potential errors in the code and suggests fixes.

Prepare Input: For the code generation profile, provide a simple query describing the code you want to generate. For other profiles, input the code you want to refactor, debug, or generate test cases for. For example:

write a python code to generate Fibonacci series

Initiate Generation: Click on the "Generate" button in the QAIRT Code extension panel.
AI Processing: The model will run using GENIE on your local device, leveraging the power of on-device AI to create the requested code.
Review and Accept: Once generation is complete, review the suggested code. If you're satisfied, click the "Accept Code" button.
Code Integration: The accepted code will be automatically inserted into your file, ready for further editing or execution.
After using the model, click the "Stop Model" button to release NPU memory. If you don't stop the model and VS Code continues running in the background, the NPU memory will remain occupied.

For a clearer understanding of this process, refer to the GIF below, which demonstrates these steps in action:

4. Adding New Models to QAIRT Code Extension

The following steps guide you through integrating new AI models into the QAIRT Code Extension.

Prerequisites

Visual Studio 2022: Install Visual Studio 2022. Add the necessary Arm64 components while installing.
Node js: Install Node.js from https://nodejs.org/en/download.
Visual Studio Code installed.
Access to Qualcomm's AI Hub models.
Basic familiarity with C++/TypeScript development.
CMake & npm installed for rebuilding.

Step 1: Model Preparation

1.1 Source the Model

Verify model availability on AI Hub Models Directory.

1.2 Convert Model

Follow the AI Hub conversion guide as outlined earlier in this tutorial under "How to Install Extension > Step 1".

Place the .bin files in the vs code installed path, default path: C:\Users\<USER>\.vscode\extensions\quic.qairt-code-completion-<VERSION>\out\server\models\New_model_name\

Step 2: Configuration Setup

2.1 Download Configuration

Get the config file for the respective model from Genie Repository
Store in extension config directory: C:\Users\<USER>\.vscode\extensions\quic.qairt-code-completion-<VERSION>\out\server\configs\New_model_name\

2.2 Update Paths

In the config, add the model paths. Ensure you include the paths for all the model chunks, and make sure the chunks are listed in the sequence.

For example:

"model": {
                "version": 1,
                "type": "binary",
                "binary": {
                    "version": 1,
                    "ctx-bins": [
                        "models/llama-v3p1/8B-FT/weight_sharing_model_1_of_5.serialized.bin",
                        "models/llama-v3p1/8B-FT/weight_sharing_model_2_of_5.serialized.bin",
                        "models/llama-v3p1/8B-FT/weight_sharing_model_3_of_5.serialized.bin",
                        "models/llama-v3p1/8B-FT/weight_sharing_model_4_of_5.serialized.bin",
                        "models/llama-v3p1/8B-FT/weight_sharing_model_5_of_5.serialized.bin"
                    ]               
         	}

Add tokenizer location as shown below:

"tokenizer": {
            "version": 1,
            "path": "models/llama-v3p1/tokenizer.json"
        }

Any other Model-specific parameters (Optional)

Step 3: Extension code Modification

3.1 Clone Repository

git clone https://github.com/quic/wos-ai-plugins.git

3.2 Backend Updates (C++)

File: plugins/vscode/qualcomm-code-gen/server/src/qcom-code-gen-server.cpp

// change 1: In line 34 Add new enum entry in `Model` list 
enum Model {
    LLAMA2_7B,
    LLAMA3_1_8B,
    New_model_name
};


// change 2: At line 50 Update `modelEnumToStringMap` dictionary  
std::map<Model, std::string> modelEnumToStringMap = {
    {LLAMA2_7B, "llama-v2-7b"},
    {LLAMA3_1_8B, "llama-v3.1-8b"},
    {New_model_name, "new_model_string"}
};


// change 3: In line 71 in the function "getModelConfig" Configure model config path.
std::string getModelConfig(Model model) {
    if (model == LLAMA2_7B) {
        return "./configs/llama2-7b/llama2-7b-htp-windows.json";
    }
    else if (model == LLAMA3_1_8B) {
        return "./configs/llama3p1-8b/llama3p1-8b-htp-windows.json";
    }
   else if (model == New_model_name) {
        return "./new_model_json_path";
   }
}
   
  
// change 4: In line 329 we have "get_query" function which is storing the string format of all the models.
// You can find the new model query format in the tokenizer.json file on Hugging Face, located under the "Files and versions" section. 
std::string get_query(Model model, std::string system_prompt, std::string prompt) {
    if (model == LLAMA3_1_8B) {
        return std::format("<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n{}\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>", system_prompt, prompt);
    }
   else if (model == New_model_name){
        return std::format("New model query format supported by GENIE");
   }
    return std::format("<s>[INST]\n<<SYS>>\n{}\n<</SYS>>\n\n{}[/INST]", system_prompt, prompt);
}
 
    
// change 5: Document new model support in the printUsage() function (line 127) 
std::cout << std::setw(width) << "  --model model";
std::cout << "Model to use. Possible values: llama-v2-7b, llama-v3.1-8b\n, New_model_name" << std::endl;

3.3 Frontend Updates (TypeScript)

File 1: plugins\vscode\qualcomm-code-gen\package.json

// change 1: In Line 187 add New_model_name in the model list 
"enum": ["existing_models", "New_model_name"]

File 2: plugins\vscode\qualcomm-code-gen\shared\model.ts

// change 1:Update ModelId and ModelName enums with new model 
enum ModelId {
  LLAMA_CHAT_V2_7B = "llama-v2-7b",
  LLAMA_CHAT_V3_1_8B = "llama-v3.1-8b",
  New_model = "new_model_id"
}
export enum ModelName {
  LLAMA_CHAT_V2_7B = 'llama-v2-7b',
  LLAMA_CHAT_V3_1_8B = "llama-v3.1-8b",
  New_model = "new_model_id"
}


// change 2: Update the MODEL_NAME_TO_ID_MAP dictionary.  
export const MODEL_NAME_TO_ID_MAP: Record<ModelName, ModelId> = {
  [ModelName.LLAMA_CHAT_V2_7B]: ModelId.LLAMA_CHAT_V2_7B,
  [ModelName.LLAMA_CHAT_V3_1_8B]: ModelId.LLAMA_CHAT_V3_1_8B,
  New_model_entry 
};


// change 3: Configure supported features.
export const MODEL_SUPPORTED_FEATURES: Record<ModelName, Features[]> = {
  [ModelName.LLAMA_CHAT_V2_7B]: [Features.CODE_COMPLETION],
  [ModelName.LLAMA_CHAT_V3_1_8B]: [Features.CODE_COMPLETION, Features.FIM],
  New_model_entry : [new_model_feature_list] 
};

File 3: plugins\vscode\qualcomm-code-gen\side-panel-ui\src\components\sections\ServerSection\ModelSelect\ModelSelect.tsx

// change 1: Add new option to model selection component
const options: SelectOptionProps<ModelName>[] = [
  { value: ModelName.LLAMA_CHAT_V2_7B },
  { value: ModelName.LLAMA_CHAT_V3_1_8B },
  { value: ModelName.New_model_name}
];

Note: Maintain model naming consistency between backend and frontend code updates

3.4 Rebuild Extension

Create an QAIRT Code-Gen VSCode extension

./build.ps1 -qnn_sdk_root "C:\Qualcomm\AIStack\QAIRT\2.28.2.241116"

You will find qairt-code-completion-1.5.2.vsix extension file in this project directory, which can be installed in VSCode.

3.5 Finalization

Install rebuilt VSIX package in VSCode.
Restart VSCode after installation.
Verify that the new model appears in the UI and check its functionality to ensure it can generate output.

5. Performance

Current performance metrics:

---
config:
    xyChart:
        width: 500
        height: 200
        tickLength: 2
        showTitle: False
---

xychart-beta horizontal
    title "Model Performance Comparison"
    x-axis ["Llama v3.1 8b", "Llama v2 7b"]
    y-axis "Tokens/Second" 1 --> 15
    bar [10, 12]

These rates ensure responsive code assistance while maintaining the benefits of on-device processing.

VSCode Extension for On‐Device Code Assistance ‐ Model Integration Guide - quic/wos-ai-plugins GitHub Wiki

1. How to Install Extension

Step 1: Get Llama 3.1 8b models

Step 2: Install Extension

Step 3: Copy QNN model files

2. How the Extension Works

2.1 Understanding the Backend: Code Flow, and Genie SDK Integration

3. How to run the extension

Starting Server

Running the model

4. Adding New Models to QAIRT Code Extension

Prerequisites

Step 1: Model Preparation

Step 2: Configuration Setup

Step 3: Extension code Modification

3.1 Clone Repository

3.2 Backend Updates (C++)

3.3 Frontend Updates (TypeScript)

3.4 Rebuild Extension

3.5 Finalization

5. Performance

⚠️ GitHub.com Fallback ⚠️

VSCode Extension for On‐Device Code Assistance ‐ Model Integration Guide - quic/wos-ai-plugins GitHub Wiki

1. How to Install Extension

Step 1: Get Llama 3.1 8b models

Step 2: Install Extension

Step 3: Copy QNN model files

2. How the Extension Works

2.1 Understanding the Backend: Code Flow, and Genie SDK Integration

3. How to run the extension

Starting Server

Running the model

4. Adding New Models to QAIRT Code Extension

Prerequisites

Step 1: Model Preparation

Step 2: Configuration Setup

Step 3: Extension code Modification

3.1 Clone Repository

3.2 Backend Updates (C++)

3.3 Frontend Updates (TypeScript)

3.4 Rebuild Extension

3.5 Finalization

5. Performance

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️