Milvus - DrAlzahraniProjects/csusb_fall2024_cse6550_team2 GitHub Wiki

Contents

  1. Installation
  2. Configuration
  3. Implementation
  4. Usage
  5. Troubleshooting

Note: This documentation accurately describes the Milvus Lite setup, collection management, and troubleshooting steps. Descriptions of vector types, embeddings, and connection handling are correct. Periodically verify that code snippets and links, such as Milvus Lite Documentation, correspond to the latest Milvus versions and documentation specifics, as these may be updated.

1. Installation

  • Purpose of Milvus Lite: Milvus Lite is integrated into our academic chatbot to enable fast similarity search on user queries, such as finding relevant articles or resources based on vector embeddings. This ensures users receive accurate responses from a large collection of educational data with minimal latency.
  • About Milvus Lite: Milvus Lite is the lightweight version of Milvus, an open-source vector database that powers AI applications with vector embeddings and similarity search.
  • Requirements:
    • Python: Ensure you have Python installed.
    • Virtual Environment: Set up a virtual environment for your project to manage dependencies.

2. Configuration

Using Docker to Install Packages

  • Purpose: This Dockerfile snippet installs essential Python libraries using Mamba and Pip.

  • Installed Libraries:

    • pymilvus: For connecting to Milvus.
    • langchain: For managing workflows.
    • streamlit: For UI support.

2.1 Installation Commands

Step 1: Installing Dependencies with Mamba

The mamba package manager is used to handle dependencies efficiently. Use the following command to install all the necessary packages listed in requirements.txt and clean up afterwards:

RUN mamba install --yes --file requirements.txt && mamba clean --all -f -y

Step 2: Installing Python Libraries with Pip

Use pip to install additional required Python libraries for the project. Below is the command to install all the necessary libraries:

RUN pip install pymilvus[model] langchain langchain_community langchain_huggingface langchain_milvus beautifulsoup4 requests nltk langchain_mistralai sentence-transformers scipy

List of Installed Libraries

  • pymilvus[model]
    For working with Milvus, a vector database.

  • langchain For building language model applications.

  • langchain_community Additional LangChain components.

  • langchain_huggingface Hugging Face integrations for LangChain.

  • langchain_milvus Integration between LangChain and Milvus.

  • beautifulsoup4
    For web scraping and parsing HTML/XML documents.

  • requests
    For handling HTTP requests.

  • nltk
    The Natural Language Toolkit for text processing.

  • langchain_mistralai
    Integrations for Mistral AI.

  • sentence-transformers
    For encoding sentences into embeddings.

3. Implementation

Milvus Implementations in Chatbot The chatbot system leverages two distinct implementations of Milvus to efficiently retrieve and process data based on user queries. These implementations are designed to maximize performance and flexibility while ensuring optimal search results.

3.1. Milvus Lite Integration (Main App)

In the main application, Milvus Lite is used in conjunction with LangChain to enable efficient search and retrieval of data from a pre-existing collection. The integration provides a seamless connection to the Milvus instance, allowing the system to directly search within the collection and retrieve the best similarity results for a given query. This implementation ensures fast and lightweight search capabilities within the core application, making it ideal for handling user interactions in real-time.

3.2. Milvus Hybrid Search (Jupyter Implementation)

In the Jupyter environment, a more advanced setup is used where a Milvus index is first created, followed by the implementation of Milvus Hybrid Search. This approach enables the system to perform more complex queries by combining vector search with traditional keyword-based search, yielding highly accurate and relevant results. The hybrid search implementation provides a deeper level of flexibility and customization, allowing for more sophisticated processing of the data to optimize the search outcomes.

By employing these two complementary Milvus configurations, the chatbot is able to handle a variety of queries with high efficiency and accuracy, ensuring an enhanced user experience.

  • Start Milvus Lite: Launch the Milvus Database and connect to it using Python.
  • Create Collection: Define a schema for your wiki articles and create a collection.
  • Insert Search and Query Data: Add documents to the collection and retrieve them based on criteria.
  • Close Connection: Once finished, close the connection.

4. Usage

Connect using PyMilvus

  • Setting up Milvus Lite Environment:

    • Creates a directory named milvus_lite to store local database files.
  • Defining the Connection:

    • Sets the MILVUS_URI to point to a local database file, enabling local storage for Milvus.
  • Connecting to Milvus:

    • Uses the initialize_milvus function, which connects to Milvus with the pymilvus connections.connect method.
    • Connects via the specified MILVUS_URI to interact with the vector database stored in this local file.
  • Purpose of This Setup:

    • Provides a lightweight Milvus environment, ideal for development and testing.
    • Allows users to work with Milvus without needing a full server deployment.

Screenshot

The screenshot below shows a Python code snippet that contains the following elements:

corpus_source Variable:

The variable corpus_source is assigned the value "https://www.csusb.edu", which appears to be the base URL for the corpus of data being scraped or accessed. start_url Variable:

The start_url is dynamically generated by formatting the corpus_source with the /cse path, resulting in the value "https://www.csusb.edu/cse". This could be used as the starting URL for crawling or scraping content from the CSUSB website's computer science and engineering department pages. MILVUS_URI Variable:

The variable MILVUS_URI is set to "milvus_vector.db", which likely refers to the database or storage location used for Milvus to store vector data.

1

Create Collection

  • Collection Creation and Loading:

    • A collection named Academic_Webpages is loaded if it already exists; otherwise, it is created.
  • Defining the Schema:

    • The schema includes:
      • Primary Key: doc_id
      • Dense and Sparse Vectors: For embedding data
      • Text Field: For storing textual content
    • CollectionSchema defines each field’s data type, length, and properties to ensure proper structure for storing academic webpage data.
  • Instantiating the Collection:

    • A new collection instance is created with strong consistency to maintain data reliability.
    • A confirmation message is printed once the collection is set up.

Screenshots

Initializing Milvus for Main App

The screenshot shows a Python function, initialize_milvus(data), which appears to handle the initialization of a Milvus collection and insert data into it.

This function is responsible for:

  • Creating or replacing a Milvus collection with a schema that includes fields for id, embedding, text_content, and url.
  • Creating an index on the embedding field for efficient similarity search.
  • Encoding text data into vector embeddings using a sentence transformer model.
  • Inserting the embeddings, along with the text and URL, into the collection.
coll

Initializing Milvus in Jupyter Notebook

Screenshot 1: Collection Name Definition

This screenshot defines a variable collection_name with the value Academic_Webpages. This variable represents the name of the collection where data related to academic webpages will be stored in the Milvus database.

3 1

Screenshot 2: Check and Load Existing Collection In this section of the code, the script checks whether the collection Academic_Webpages already exists in the Milvus database using utility.has_collection(collection_name).

If the collection exists, it loads the collection using Collection(name=collection_name) and prints a message indicating that the collection already exists. The function then returns the existing collection object.

coll_name

Screenshot 3: Create New Collection Schema and Insert Data

Here, if the collection Academic_Webpages doesn't exist, the script creates a new collection.

It prints a message indicating the creation of the collection. The schema is defined with several fields:

  • pk_field: Primary key field (doc_id), which is of type VARCHAR.
  • dense_field: A dense vector field of type FLOAT_VECTOR with a defined dimension (dense_dim).
  • sparse_field: A sparse vector field of type SPARSE_FLOAT_VECTOR.
  • text_field: A text field of type VARCHAR for storing webpage text.
  • The schema is passed to CollectionSchema, and the collection is created using the Collection class with the defined schema and consistency level.

new_collection_schema

Create Index

  • Schema and Collection Creation:

    • Initializes a schema and a collection named Academic_Webpages.
    • Sets a strong consistency level to ensure reliable data handling.
  • Index Definition and Application:

    • Dense Vector Field: Configured with a FLAT index type.
    • Sparse Vector Field: Configured with a SPARSE_INVERTED_INDEX.
    • Both indexes use the IP (Inner Product) metric for similarity search.
  • Index Confirmation:

    • A confirmation message is displayed once indexes are applied to their respective fields.
  • Persisting Changes:

    • The collection is flushed to save all updates, ensuring data persistence.

Dense Vector Collection


# Define the schema for your collection
fields = [
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128),  # Dimensionality of the dense vector
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
]
schema = CollectionSchema(fields)

# Create collection
collection = Collection(name="dense_vector_collection", schema=schema)

# Create the index
index_params = {
    "index_type": "HNSW",  # You can choose "IVF_FLAT", "IVF_PQ", "HNSW"
    "metric_type": "L2",   # or "IP" for cosine similarity
    "params": {"M": 16, "efConstruction": 200}  # HNSW-specific parameters
}
collection.create_index(field_name="embedding", index_params=index_params)

Sparse Vector Collection


# Define schema for sparse vectors
fields = [
    FieldSchema(name="sparse_embedding", dtype=DataType.SPARSE_FLOAT_VECTOR, dim=128),  # Sparse vector field
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
]
schema = CollectionSchema(fields)

# Create collection
collection = Collection(name="sparse_vector_collection", schema=schema)

# Create index for sparse vector
index_params = {
    "index_type": "IVF_FLAT",  # IVF_FLAT can also be used for sparse data
    "metric_type": "L2",       # "IP" for cosine similarity
    "params": {}
}

collection.create_index(field_name="sparse_embedding", index_params=index_params)

Insert Vectors into Collection

  • Initializing Entities List:

    • Starts with an empty list called entities to store vector data.
  • Generating Vectors:

    • For each item in text_contents, dense and sparse vectors are generated using embedding functions.
    • These vectors are appended to the entities list.
  • Checking for Existing Data:

    • Evaluates collection.num_entities to determine if the collection already contains data.
    • If the collection is empty, the new entities are inserted.
    • If data already exists, insertion is skipped and a message is printed to indicate no insertion was necessary.
# Generate corresponding IDs for the vectors
ids = [i for i in range(num_vectors)]

# Prepare the data to be inserted
data = [
    ids,
    sparse_vectors
]

# Insert the sparse vectors into the collection
collection.insert(data)

Load Collection

  • Loading the Collection into Memory:

    • The collection.load() function loads the specified collection into memory in Milvus.
  • Purpose:

    • This step is essential for optimizing search performance.
    • Ensures data is readily accessible in memory for fast and efficient query execution.

collection.load() Usage

The collection.load() function is used to load a collection into memory for efficient querying in Milvus

collection.load()

3.7 Search using Milvus Hybrid Search

  • Hybrid Search Definition:

    • The retriever function performs a hybrid search using Milvus, combining dense and sparse embeddings for improved retrieval accuracy.
  • Preprocessing and Initialization:

    • Preprocesses text_contents and initializes embedding functions:
      • Dense Embeddings: Generated with Hugging Face models.
      • Sparse Embeddings: Generated with BM25 embeddings.
  • Configuring Search Parameters:

    • After loading the Milvus collection, configures search parameters:
      • Dense Vector Field: dense_vector with IP (Inner Product) metric.
      • Sparse Vector Field: sparse_vector with IP (Inner Product) metric.
  • Instantiating the Hybrid Retriever:

    • The MilvusCollectionHybridSearchRetriever is instantiated with:
      • Specified fields, embeddings, search parameters, and a weighted ranking scheme.
    • Combines dense and sparse results to return the configured hybrid retriever.

Screenshot

Milvus Hybrid search

The code snippet in the screenshot below defines the retriever() function, which is responsible for initializing the necessary components to retrieve relevant data using both dense and sparse vectors from a Milvus collection.

milvus_hybrid_search

Milvus Collection Search

The code snippet defines the function retrieve_context() which is responsible for retrieving relevant context from a Milvus collection based on a query embedding.

retreiver_context

Close Connection

  • Disconnecting from Milvus Server:

    • This code snippet demonstrates how to disconnect from a Milvus server using pymilvus.
  • Importing the Connections Module:

    • Imports the connections module from pymilvus, which manages connections to Milvus instances.
  • Disconnecting the Active Connection:

    • Uses connections.disconnect("default") to disconnect the active connection labeled "default".
    • This safely closes the session with the Milvus server, releasing resources and ending communication.
from pymilvus import connections


# Close the connection
connections.disconnect("default")

5. Troubleshooting

  • Connection Error:

    • If you encounter a ConnectionError, ensure that:
      • The URI is correct.
      • Milvus Lite is running.
  • Vector Type Mismatch Error:

    • This error occurs when vector types differ between the query and the collection schema.
    • Solution: Ensure that the vector type in your query matches the collection schema (e.g., both should be VECTOR_FLOAT or VECTOR_SPARSE_FLOAT).

    error

    The above error shows when there is a mismatch of vector types while inserting documents into the collection.

  • Embedding Errors:

    • If embeddings are not generated correctly for inputs like "Hi," consider:
      • Validating the query input.
      • Using a fallback response for unsupported inputs.
⚠️ **GitHub.com Fallback** ⚠️