Milvus - DrAlzahraniProjects/csusb_fall2024_cse6550_team2 GitHub Wiki
Note: This documentation accurately describes the Milvus Lite setup, collection management, and troubleshooting steps. Descriptions of vector types, embeddings, and connection handling are correct. Periodically verify that code snippets and links, such as Milvus Lite Documentation, correspond to the latest Milvus versions and documentation specifics, as these may be updated.
- Purpose of Milvus Lite: Milvus Lite is integrated into our academic chatbot to enable fast similarity search on user queries, such as finding relevant articles or resources based on vector embeddings. This ensures users receive accurate responses from a large collection of educational data with minimal latency.
- About Milvus Lite: Milvus Lite is the lightweight version of Milvus, an open-source vector database that powers AI applications with vector embeddings and similarity search.
-
Requirements:
- Python: Ensure you have Python installed.
- Virtual Environment: Set up a virtual environment for your project to manage dependencies.
-
Purpose: This Dockerfile snippet installs essential Python libraries using Mamba and Pip.
-
Installed Libraries:
-
pymilvus
: For connecting to Milvus. -
langchain
: For managing workflows. -
streamlit
: For UI support.
-
The mamba
package manager is used to handle dependencies efficiently. Use the following command to install all the necessary packages listed in requirements.txt
and clean up afterwards:
RUN mamba install --yes --file requirements.txt && mamba clean --all -f -y
Use pip to install additional required Python libraries for the project. Below is the command to install all the necessary libraries:
RUN pip install pymilvus[model] langchain langchain_community langchain_huggingface langchain_milvus beautifulsoup4 requests nltk langchain_mistralai sentence-transformers scipy
-
pymilvus[model]
For working with Milvus, a vector database. -
langchain
For building language model applications. -
langchain_community
Additional LangChain components. -
langchain_huggingface
Hugging Face integrations for LangChain. -
langchain_milvus
Integration between LangChain and Milvus. -
beautifulsoup4
For web scraping and parsing HTML/XML documents. -
requests
For handling HTTP requests. -
nltk
The Natural Language Toolkit for text processing. -
langchain_mistralai
Integrations for Mistral AI. -
sentence-transformers
For encoding sentences into embeddings.
Milvus Implementations in Chatbot The chatbot system leverages two distinct implementations of Milvus to efficiently retrieve and process data based on user queries. These implementations are designed to maximize performance and flexibility while ensuring optimal search results.
In the main application, Milvus Lite is used in conjunction with LangChain to enable efficient search and retrieval of data from a pre-existing collection. The integration provides a seamless connection to the Milvus instance, allowing the system to directly search within the collection and retrieve the best similarity results for a given query. This implementation ensures fast and lightweight search capabilities within the core application, making it ideal for handling user interactions in real-time.
In the Jupyter environment, a more advanced setup is used where a Milvus index is first created, followed by the implementation of Milvus Hybrid Search. This approach enables the system to perform more complex queries by combining vector search with traditional keyword-based search, yielding highly accurate and relevant results. The hybrid search implementation provides a deeper level of flexibility and customization, allowing for more sophisticated processing of the data to optimize the search outcomes.
By employing these two complementary Milvus configurations, the chatbot is able to handle a variety of queries with high efficiency and accuracy, ensuring an enhanced user experience.
- Start Milvus Lite: Launch the Milvus Database and connect to it using Python.
- Create Collection: Define a schema for your wiki articles and create a collection.
- Insert Search and Query Data: Add documents to the collection and retrieve them based on criteria.
- Close Connection: Once finished, close the connection.
-
Setting up Milvus Lite Environment:
- Creates a directory named
milvus_lite
to store local database files.
- Creates a directory named
-
Defining the Connection:
- Sets the
MILVUS_URI
to point to a local database file, enabling local storage for Milvus.
- Sets the
-
Connecting to Milvus:
- Uses the
initialize_milvus
function, which connects to Milvus with thepymilvus
connections.connect
method. - Connects via the specified
MILVUS_URI
to interact with the vector database stored in this local file.
- Uses the
-
Purpose of This Setup:
- Provides a lightweight Milvus environment, ideal for development and testing.
- Allows users to work with Milvus without needing a full server deployment.
Screenshot
The screenshot below shows a Python code snippet that contains the following elements:
corpus_source
Variable:
The variable corpus_source is assigned the value "https://www.csusb.edu", which appears to be the base URL for the corpus of data being scraped or accessed. start_url Variable:
The start_url is dynamically generated by formatting the corpus_source with the /cse path, resulting in the value "https://www.csusb.edu/cse". This could be used as the starting URL for crawling or scraping content from the CSUSB website's computer science and engineering department pages.
MILVUS_URI
Variable:
The variable MILVUS_URI
is set to "milvus_vector.db", which likely refers to the database or storage location used for Milvus to store vector data.
-
Collection Creation and Loading:
- A collection named
Academic_Webpages
is loaded if it already exists; otherwise, it is created.
- A collection named
-
Defining the Schema:
- The schema includes:
-
Primary Key:
doc_id
- Dense and Sparse Vectors: For embedding data
- Text Field: For storing textual content
-
Primary Key:
-
CollectionSchema
defines each field’s data type, length, and properties to ensure proper structure for storing academic webpage data.
- The schema includes:
-
Instantiating the Collection:
- A new collection instance is created with strong consistency to maintain data reliability.
- A confirmation message is printed once the collection is set up.
Screenshots
Initializing Milvus for Main App
The screenshot shows a Python function, initialize_milvus(data), which appears to handle the initialization of a Milvus collection and insert data into it.
This function is responsible for:
- Creating or replacing a Milvus collection with a schema that includes fields for
id
,embedding
,text_content
, andurl
. - Creating an index on the embedding field for efficient similarity search.
- Encoding text data into vector embeddings using a sentence transformer model.
- Inserting the embeddings, along with the text and URL, into the collection.
Initializing Milvus in Jupyter Notebook
Screenshot 1: Collection Name Definition
This screenshot defines a variable collection_name with the value Academic_Webpages. This variable represents the name of the collection where data related to academic webpages will be stored in the Milvus database.
Screenshot 2: Check and Load Existing Collection
In this section of the code, the script checks whether the collection Academic_Webpages already exists in the Milvus database using utility.has_collection(collection_name)
.
If the collection exists, it loads the collection using Collection(name=collection_name) and prints a message indicating that the collection already exists. The function then returns the existing collection object.
Screenshot 3: Create New Collection Schema and Insert Data
Here, if the collection Academic_Webpages doesn't exist, the script creates a new collection.
It prints a message indicating the creation of the collection. The schema is defined with several fields:
-
pk_field
: Primary key field (doc_id), which is of type VARCHAR. -
dense_field
: A dense vector field of type FLOAT_VECTOR with a defined dimension (dense_dim). -
sparse_field
: A sparse vector field of type SPARSE_FLOAT_VECTOR. -
text_field
: A text field of type VARCHAR for storing webpage text. - The schema is passed to CollectionSchema, and the collection is created using the Collection class with the defined schema and consistency level.
-
Schema and Collection Creation:
- Initializes a schema and a collection named
Academic_Webpages
. - Sets a strong consistency level to ensure reliable data handling.
- Initializes a schema and a collection named
-
Index Definition and Application:
-
Dense Vector Field: Configured with a
FLAT
index type. -
Sparse Vector Field: Configured with a
SPARSE_INVERTED_INDEX
. - Both indexes use the IP (Inner Product) metric for similarity search.
-
Dense Vector Field: Configured with a
-
Index Confirmation:
- A confirmation message is displayed once indexes are applied to their respective fields.
-
Persisting Changes:
- The collection is flushed to save all updates, ensuring data persistence.
Dense Vector Collection
# Define the schema for your collection
fields = [
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128), # Dimensionality of the dense vector
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
]
schema = CollectionSchema(fields)
# Create collection
collection = Collection(name="dense_vector_collection", schema=schema)
# Create the index
index_params = {
"index_type": "HNSW", # You can choose "IVF_FLAT", "IVF_PQ", "HNSW"
"metric_type": "L2", # or "IP" for cosine similarity
"params": {"M": 16, "efConstruction": 200} # HNSW-specific parameters
}
collection.create_index(field_name="embedding", index_params=index_params)
Sparse Vector Collection
# Define schema for sparse vectors
fields = [
FieldSchema(name="sparse_embedding", dtype=DataType.SPARSE_FLOAT_VECTOR, dim=128), # Sparse vector field
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
]
schema = CollectionSchema(fields)
# Create collection
collection = Collection(name="sparse_vector_collection", schema=schema)
# Create index for sparse vector
index_params = {
"index_type": "IVF_FLAT", # IVF_FLAT can also be used for sparse data
"metric_type": "L2", # "IP" for cosine similarity
"params": {}
}
collection.create_index(field_name="sparse_embedding", index_params=index_params)
-
Initializing Entities List:
- Starts with an empty list called
entities
to store vector data.
- Starts with an empty list called
-
Generating Vectors:
- For each item in
text_contents
, dense and sparse vectors are generated using embedding functions. - These vectors are appended to the
entities
list.
- For each item in
-
Checking for Existing Data:
- Evaluates
collection.num_entities
to determine if the collection already contains data. - If the collection is empty, the new entities are inserted.
- If data already exists, insertion is skipped and a message is printed to indicate no insertion was necessary.
- Evaluates
# Generate corresponding IDs for the vectors
ids = [i for i in range(num_vectors)]
# Prepare the data to be inserted
data = [
ids,
sparse_vectors
]
# Insert the sparse vectors into the collection
collection.insert(data)
-
Loading the Collection into Memory:
- The
collection.load()
function loads the specified collection into memory in Milvus.
- The
-
Purpose:
- This step is essential for optimizing search performance.
- Ensures data is readily accessible in memory for fast and efficient query execution.
The collection.load() function is used to load a collection into memory for efficient querying in Milvus
collection.load()
-
Hybrid Search Definition:
- The
retriever
function performs a hybrid search using Milvus, combining dense and sparse embeddings for improved retrieval accuracy.
- The
-
Preprocessing and Initialization:
-
Preprocesses
text_contents
and initializes embedding functions:- Dense Embeddings: Generated with Hugging Face models.
- Sparse Embeddings: Generated with BM25 embeddings.
-
Preprocesses
-
Configuring Search Parameters:
- After loading the Milvus collection, configures search parameters:
-
Dense Vector Field:
dense_vector
with IP (Inner Product) metric. -
Sparse Vector Field:
sparse_vector
with IP (Inner Product) metric.
-
Dense Vector Field:
- After loading the Milvus collection, configures search parameters:
-
Instantiating the Hybrid Retriever:
- The
MilvusCollectionHybridSearchRetriever
is instantiated with:- Specified fields, embeddings, search parameters, and a weighted ranking scheme.
- Combines dense and sparse results to return the configured hybrid retriever.
- The
Screenshot
Milvus Hybrid search
The code snippet in the screenshot below defines the retriever()
function, which is responsible for initializing the necessary components to retrieve relevant data using both dense and sparse vectors from a Milvus collection.
Milvus Collection Search
The code snippet defines the function retrieve_context()
which is responsible for retrieving relevant context from a Milvus collection based on a query embedding.
-
Disconnecting from Milvus Server:
- This code snippet demonstrates how to disconnect from a Milvus server using
pymilvus
.
- This code snippet demonstrates how to disconnect from a Milvus server using
-
Importing the Connections Module:
- Imports the
connections
module frompymilvus
, which manages connections to Milvus instances.
- Imports the
-
Disconnecting the Active Connection:
- Uses
connections.disconnect("default")
to disconnect the active connection labeled "default". - This safely closes the session with the Milvus server, releasing resources and ending communication.
- Uses
from pymilvus import connections
# Close the connection
connections.disconnect("default")
-
Connection Error:
- If you encounter a
ConnectionError
, ensure that:- The URI is correct.
- Milvus Lite is running.
- If you encounter a
-
Vector Type Mismatch Error:
- This error occurs when vector types differ between the query and the collection schema.
- Solution: Ensure that the vector type in your query matches the collection schema (e.g., both should be
VECTOR_FLOAT
orVECTOR_SPARSE_FLOAT
).
The above error shows when there is a mismatch of vector types while inserting documents into the collection.
-
Embedding Errors:
- If embeddings are not generated correctly for inputs like "Hi," consider:
- Validating the query input.
- Using a fallback response for unsupported inputs.
- If embeddings are not generated correctly for inputs like "Hi," consider: