VDK and LLamaIndex - vmware/versatile-data-kit GitHub Wiki

Integrating Versatile Data Kit (VDK) with LLamaIndex can open up new possibilities for enhancing Retrieval-Augmented Generation (RAG) applications by leveraging the strengths of both platforms.

Before continuing with the reading, you can check:

LLamaIndex's extensive integration capabilities with various data sources provide a robust foundation for data ingestion and indexing, which might make it seem that VDK's role in preprocessing and data pipeline management could be redundant. However, the integration of VDK with LLamaIndex can still offer significant value by enhancing data quality, efficiency, orchestration and operational scalability(Gabi will be checking it).

LLamaIndex's data processing

Note: In LLamaIndex, "pipeline" refers to a sequence of processes designed to handle data systematically.

  • Query pipelines: LLamaIndex features a declarative query API that enables chaining together different modules (like LLMs, prompts, retrievers, and other pipelines) to orchestrate workflows over data. This QueryPipeline abstraction facilitates the creation of simple-to-advanced workflows, improving code readability and integration with low-code/no-code solutions. It supports common RAG-related tasks such as query rewriting, retrieval, reranking, and response synthesis. For instance, in a customer service bot, it can reformulate user queries into formats better understood by LLMs, retrieve relevant information, and synthesize responses.
  • Ingestion pipeline: The IngestionPipeline in LLamaIndex focuses on transforming input data and inserting it into a vector database, if available. It applies transformations to input data, and the resulting nodes are either returned for further processing or inserted into a vector database. This pipeline is designed to enhance data quality and efficiency by applying a series of transformations (e.g., sentence splitting, title extraction, embedding generation) to prepare data for indexing and retrieval tasks.

Hypothetical scenario with LLamaIndex:

from llama_hub.confluence import ConfluenceReader
from llama_index.ingestion_pipeline import IngestionPipeline
from llama_index.query_pipeline import QueryPipeline
from your_transformations import YourSentenceSplitter, YourEmbeddingGenerator
from your_query_modules import YourRetriever, YourResponseGenerator


oauth2_dict = {
    "client_id": "<client_id>",
    "token": {
        "access_token": "<access_token>",
        "token_type": "<token_type>"
    }
}
base_url = "https://yoursite.atlassian.com/wiki"
cql = 'type="page" AND label="devops"'

reader = ConfluenceReader(base_url=base_url, oauth2=oauth2_dict)
documents = reader.load_data(cql=cql, max_num_results=5)

# Initialize the Ingestion Pipeline with transformations
ingestion_pipeline = IngestionPipeline(transformations=[
    YourSentenceSplitter(),
    YourEmbeddingGenerator(),
])

# Process the fetched documents
processed_documents = [ingestion_pipeline.run(data=document['content']) for document in documents]

# Assuming processed_documents are now indexed in a vector store for retrieval

# Setting up a QueryPipeline for data retrieval and processing
query_pipeline = QueryPipeline(modules=[
    YourRetriever(),  # Custom module to retrieve documents based on a query
    YourResponseGenerator(),  # Custom module to generate responses from retrieved documents
])

# Executing a query with the QueryPipeline
query_result = query_pipeline.run(query="How to implement CI/CD in DevOps projects?")

# Output the query result
print(query_result)

Problems found while exploring the above scenario:

  • Lack of direct change tracking: When indexing content from Confluence, there's often no direct way to track changes in the documents. This is because the tool provides a way to index the documents at a specific point in time, and any changes made to the documents post-indexing might not be immediately reflected in the index. This can lead to discrepancies between the indexed content and the actual content. Solutions to this issue typically involve setting up periodic re-indexing to capture changes, but this can be resource-intensive and might not be suitable for environments where documents are updated frequently. (Here scheduled data jobs can be used to solve the issue)
  • In-memory load: Indexing tools often load significant portions of the data into memory to facilitate fast searching and retrieval. While this is great for performance, it can become a problem when dealing with a large number of documents. High memory usage can lead to increased costs for hardware or cloud services. Moreover, if the system runs out of memory, it can lead to performance degradation or even crashes, affecting availability and user experience.
  • Security: LLamaIndex offers modules to connect with other vector stores within indexes to store embeddings. It is worth noting that each vector store has its own privacy policies and practices, and LLamaIndex does not assume responsibility for how they handle or use your data.

Check: A data job that uses LLamaIndex for retrieving Confluence data and putting it into vector database

Concept: While LLamaIndex excels at connecting and ingesting data from a multitude of sources, VDK can be employed to implement advanced data transformation, quality enhancement, and orchestration layers that preprocess data in complex ways before it's indexed by LLamaIndex.

How VDK can complement LLamaIndex

VDK can enhance the capabilities of LLamaIndex by providing advanced data pipeline management, scheduling, and deployment features that are not explicitly covered by LLamaIndex's focus on data ingestion, indexing, and query processing.

  • Advanced data pipeline orchestration: VDK excels at orchestrating complex data pipelines, allowing for the automation of data ingestion, processing, and transformation workflows. By integrating VDK with LLamaIndex, developers can automate the preprocessing of data before it's ingested into LLamaIndex for indexing. This could include cleaning, normalization, enrichment, and transformation of data from disparate sources, ensuring that the data fed into LLamaIndex is of the highest quality and ready for efficient indexing and retrieval. (Putting everything in a data job or a dag)
  • Scheduled and scalable data processing: One of VDK's core strengths is its ability to schedule data pipelines to run at specific intervals, ensuring data freshness and relevance for RAG applications. This scheduled processing is crucial for applications that rely on up-to-date information to provide accurate and contextually relevant responses. Furthermore, VDK's scalable architecture (here in theory it should be right) ensures that as data volumes grow, the data pipelines can scale accordingly, maintaining high performance and reliability without manual intervention.
  • Data quality and reliability: With VDK's data quality checks and monitoring features, data pipelines can be configured to ensure that only data that meets specific quality criteria is indexed by LLamaIndex (Data Jobs in case of DAG, Steps in case of single job, templates that check the quality can be added). This not only improves the reliability of the RAG application but also enhances the user experience by providing more accurate and relevant responses. VDK's monitoring capabilities also allow for real-time tracking of data pipeline health, enabling quick identification and resolution of any issues that may arise (the job would fail because it would not pass the quality check).

LLama packs

Llama Packs are built on top of LlamaIndex. These packs leverage LlamaIndex's infrastructure to streamline the development of LLM apps, offering pre-configured modules that can be directly integrated into projects. For instance, a RAG pipeline template within Llama Packs would use LlamaIndex's data indexing and query processing features to facilitate efficient information retrieval and augmentation for LLMs.

Limitations of Llama packs:

  • Llama Packs focus primarily on accelerating LLM application development with specific modules and templates, such as RAG pipelines, without offering broader data pipeline orchestration and automation.
  • They do not provide scheduling capabilities for automated data updates or workflows.
  • Llama Packs lack built-in features for data quality checks and governance across the data lifecycle.
  • They are not designed for infrastructure management, making them less suited for handling the deployment and scalability challenges of large-scale applications.
⚠️ **GitHub.com Fallback** ⚠️