Semantic Search Retrieval with Google AQA - gunpal5/Google_GenerativeAI GitHub Wiki

This guide walks you through setting up semantic search retrieval using Google’s AQA model and the GenerativeAI and DocumentChunker libraries. It includes an introduction to the key concepts involved.


Table of Contents

  1. Introduction
  2. Prerequisites
  3. Google Cloud Setup (One-Time)
  4. Project Setup (NuGet Packages)
  5. Authentication and Model Initialization
  6. Corpus Management
  7. Document and Chunking
  8. Question Answering

Introduction: Understanding RAG and Semantic Search

Retrieval-Augmented Generation (RAG):

  • RAG combines the power of large language models (LLMs) with information retrieval.
  • Instead of just using the LLM’s internal knowledge, RAG systems first retrieve relevant information from a knowledge base (a corpus of documents).
  • Then, they augment the LLM’s response generation with this retrieved information.
  • This leads to:
    • More accurate and factual answers.
    • Up-to-date information (since you can update the corpus).
    • Ability to answer questions about specific documents.

Semantic Search Retrieval:

  • Traditional search engines often rely on keyword matching.
  • Semantic search goes deeper. It understands the meaning and context of both the search query and the documents.
  • It uses techniques like embeddings (vector representations of text) to find documents that are semantically similar to the query, even if they don’t share the exact same keywords.

Google AQA (Attributed Question-Answering):

  • This model is specifically designed for semantic search and question answering.
  • It’s good at:
    • Understanding the intent of a question.
    • Finding the most relevant passages within a corpus.
    • Providing an "Answerable Probability" score (confidence in the answer).

This tutorial shows you how to build a simple RAG system using Google AQA.


Prerequisites

  1. C# Project: A new or existing .NET console application.
  2. NuGet Packages:
    • Google_GenerativeAI
    • DocumentChunker

1. Google Cloud Setup (One-Time)

Service Account:

  1. Open the Google Cloud Console.
  2. Go to “IAM & Admin” → “Service accounts”.
  3. Create/Use a service account.
  4. Create and download a JSON key.
  5. Environment Variable: Set Google_Service_Account_Json to the JSON key file’s full path.

2. Project Setup (NuGet Packages)

In your C# project, install the necessary NuGet packages:

dotnet add package Google_GenerativeAI
dotnet add package DocumentChunker

3. Authentication and Model Initialization

Get Service Account Path

Retrieve the path to your service account key.

string serviceAccountKeyPath = Environment.GetEnvironmentVariable("Google_Service_Account_Json")!;
if (string.IsNullOrEmpty(serviceAccountKeyPath))
{
    throw new Exception("The Google_Service_Account_Json environment variable is not set.");
}

Create Authenticator

Create a GoogleServiceAccountAuthenticator.

var authenticator = new GoogleServiceAccountAuthenticator(serviceAccountKeyPath);

Initialize Retriever Model

Create a SemanticRetrieverModel for AQA.

var retrieverModel = new SemanticRetrieverModel(
    GoogleAIModels.Aqa,
    EnvironmentVariables.GOOGLE_API_KEY,
    authenticator: authenticator
);

4. Corpus Management

Get Corpora Manager

Access the CorporaManager.

var corporaManager = retrieverModel.CorporaManager;

Get or Create Corpus (Helper Function)

This function handles finding or creating your corpus.

private static async Task<Corpus> GetOrCreateCorpusAsync(CorporaManager corporaManager, string displayName)
{
    var corporaList = await corporaManager.ListCorporaAsync();
    Corpus? existingCorpus = corporaList?.FirstOrDefault(c => c.DisplayName == displayName);

    if (existingCorpus != null)
    {
        return existingCorpus;
    }
    else
    {
        return await corporaManager.CreateCorpusAsync(displayName);
    }
}

Call the Helper Function

Use the function to get your corpus.

string corpusDisplayName = "My Search Corpus";
Corpus? corpus = await GetOrCreateCorpusAsync(corporaManager, corpusDisplayName);

5. Document and Chunking

Add Document and Chunks (Helper Function)

This function adds a document and splits it into manageable chunks.

private static async Task AddDocumentAndChunksAsync(
    CorporaManager corporaManager,
    string corpusName,
    string contentUrl,
    string documentName,
    string author)
{
    // Use DocumentChunker to split the text.
    var chunker = new PlainTextDocumentChunker(new ChunkerConfig(500, ChunkType.Paragraph));

    var document = await corporaManager.AddDocumentAsync(
        corpusName,
        documentName,
        new List<CustomMetadata> { new CustomMetadata() { Key = "Author", StringValue = author } }
    );

    await foreach (var textParts in chunker.ExtractChunksInPartsFromUrlAsync(contentUrl, 100))
    {
        var chunks = textParts.Select(text => new Chunk() { Data = new ChunkData() { StringValue = text } }).ToList();
        await corporaManager.AddChunksAsync(document.Name, chunks);
    }
}

Call the Helper Function

Add your document (from a URL in this example).

string documentUrl = "https://www.gutenberg.org/cache/epub/1184/pg1184.txt";
string documentDisplayName = "The Count of Monte Cristo";
string authorName = "Alexandre Dumas";

await AddDocumentAndChunksAsync(
    corporaManager,
    corpus.Name,
    documentUrl,
    documentDisplayName,
    authorName
);

6. Question Answering

Create Chat Session

Create a ChatSession for interacting with the corpus.

var chatSession = retrieverModel.CreateChatSession(corpus.Name, AnswerStyle.VERBOSE); 

Ask a Question

Provide your question.

string userQuestion = "What is Edmond Dantes imprisoned for?";

Generate Answer

Get the response from the model.

var answerResponse = await chatSession.GenerateAnswerAsync(userQuestion);

Process and Display the Response

Show the answer and its confidence level.

Console.WriteLine($"Question: {userQuestion}");
Console.WriteLine($"Answer: {answerResponse.GetAnswer()}");
Console.WriteLine($"Answerable Probability: {answerResponse.AnswerableProbability}");

Complete Code

using GenerativeAI;
using GenerativeAI.Authenticators;
using GenerativeAI.Clients;
using GenerativeAI.Types;
using DocumentChunker.Chunkers;
using DocumentChunker.Core;
using DocumentChunker.Enum;

public class Program
{
    public static async Task Main(string[] args)
    {
        // --- 1. Authentication and Model Initialization ---

        string serviceAccountKeyPath = Environment.GetEnvironmentVariable("Google_Service_Account_Json")!;
        if (string.IsNullOrEmpty(serviceAccountKeyPath))
        {
            throw new Exception("The Google_Service_Account_Json environment variable is not set.");
        }
        var authenticator = new GoogleServiceAccountAuthenticator(serviceAccountKeyPath);
        var retrieverModel = new SemanticRetrieverModel(GoogleAIModels.Aqa, EnvironmentVariables.GOOGLE_API_KEY, authenticator: authenticator);

        // --- 2. Corpus Management ---

        var corporaManager = retrieverModel.CorporaManager;
        string corpusDisplayName = "My Search Corpus";
        Corpus? corpus = await GetOrCreateCorpusAsync(corporaManager, corpusDisplayName);

        // --- 3. Document and Chunking ---

        string documentUrl = "https://www.gutenberg.org/cache/epub/1184/pg1184.txt"; // Example: Count of Monte Cristo
        string documentDisplayName = "The Count of Monte Cristo";
        string authorName = "Alexandre Dumas";

        await AddDocumentAndChunksAsync(corporaManager, corpus.Name, documentUrl, documentDisplayName, authorName);

        // --- 4. Question Answering ---

        var chatSession = retrieverModel.CreateChatSession(corpus.Name, AnswerStyle.VERBOSE);
        string userQuestion = "What is Edmond Dantes imprisoned for?";
        var answerResponse = await chatSession.GenerateAnswerAsync(userQuestion);

        Console.WriteLine($"Question: {userQuestion}");
        Console.WriteLine($"Answer: {answerResponse.GetAnswer()}");
        Console.WriteLine($"Answerable Probability: {answerResponse.AnswerableProbability}");
    }

    // --- Helper Functions ---

    private static async Task<Corpus> GetOrCreateCorpusAsync(CorporaManager corporaManager, string displayName)
    {
        var corporaList = await corporaManager.ListCorporaAsync();
        Corpus? existingCorpus = corporaList?.FirstOrDefault(c => c.DisplayName == displayName);

        if (existingCorpus != null)
        {
            return existingCorpus;
        }
        else
        {
            return await corporaManager.CreateCorpusAsync(displayName);
        }
    }

    private static async Task AddDocumentAndChunksAsync(CorporaManager corporaManager, string corpusName, string contentUrl, string documentName, string author)
    {
        var chunker = new PlainTextDocumentChunker(new ChunkerConfig(500, ChunkType.Paragraph));

        var document = await corporaManager.AddDocumentAsync(corpusName, documentName,
            new List<CustomMetadata> { new CustomMetadata() { Key = "Author", StringValue = author } });

        await foreach (var textParts in chunker.ExtractChunksInPartsFromUrlAsync(contentUrl, 100))
        {
            var chunks = textParts.Select(text => new Chunk() { Data = new ChunkData() { StringValue = text } }).ToList();
            await corporaManager.AddChunksAsync(document.Name, chunks);
        }
    }
}
⚠️ **GitHub.com Fallback** ⚠️