adding_retriever - Sidies/MasterThesis-HubLink GitHub Wiki

title: Adding a new Retriever

This page explains how to add a new retriever to the SQA system. The retriever is responsible for retrieving relevant documents or triples from a selection of documents or triples based on a given query. The retriever is parts of the RAG pipeline implemented in the SQA system.

Adding a new Retriever

Retrievers in the SQA system are located in the retrieval folder: ../blob/experiments/sqa-system/sqa_system/retrieval/implementations/. To add a new retriever, you need to create a new folder for the retriever in the implementations folder. The folder name should be the name of the retriever. Inside of the folder, you can then add your python implementations for the retriever. The entry point should be a main python file which is a subclass of the Retriever class located in the implementations/base folder: ../blob/experiments/sqa-system/sqa_system/retrieval/implementations/base. You should select between two types of retrievers. A DocumentRetriever or a KnowledgeGraphRetriever.

The DocumentRetriever works with text documents as input sources, specifically the full text of scientific publications. These documents may be pre-processed by the respective retriever before a question is submitted for which the retriever then finds relevant text passages in the documents. Unlike the KG variant, the DocumentRetriever is not bound to a specific knowledge base. Instead, the corresponding RetrievalConfig specifies the dataset configuration, for the data that should be used for retrieval.
The KnowledgeGraphRetriever operates directly on a Knowledge Graph (KG). Instead of documents, it uses a connection to an existing KG. Similar to the document-based version, the retriever can index or prepare the graph before query time. When a question is submitted, the retriever has to find the relevant triples in the KG.

1. Implementation of the Retriever

Both retriever types are subclasses of the Retriever class and work relatively similar. The main difference is the input data type and the configuration class used. For either, you should implement the retrieve() method, which is the main method to conduct the retrieval process which is called by the RetrievalPipeline: see ../blob/experiments/sqa-system/sqa_system/pipeline/retrieval_pipeline.py.

A retriever naturally has its own configuration parameters. To add these parameters, you should add them to the ADDITIONAL_CONFIG_PARAMS list in the class. The parameters should be of type AdditionalConfigParameter for example:

AdditionalConfigParameter(
        name="embedding_config",
        description="Configuration for the embeddings model.",
        param_type=EmbeddingConfig,
        available_values=[],
        default_value=_DEFAULT_EMBEDDING_CONFIG
    )

2. Registration in the Factory

After the implementation of the retriever, you need to register it in the DocumentRetrieverFactory or KnowledgeGraphRetrieverFactory classes located in ../blob/experiments/sqa-system/sqa_system/retrieval/factory/. It depends on the type of retriever you implemented. The factory is responsible for creating instances of the retriever based on the configuration.

First, you need to add the retriever in the enum and give it a representative name which will be used in the configuration:

class KnowledgeGraphRetrieverType(Enum):
    """
    A enum class that represents the different types of knowledge graph retrievers
    that are available in the system.
    """
    NAME = "name"
    ...

class DocumentRetrieverType(Enum):
    """
    This is the main enum that maps a string to the retriever type.
    It is used in the configuration files to specify exactly the 
    retriever to use.
    
    If new retrievers are added, they should be added here as well.
    """
    NAME = "name"
    ...

Next, you need to add the import of the retriever in the import_retriever() method of the factory class. This method is responsible for importing the retriever class based on the type provided. The import is done dynamically to avoid unnecessary dependencies if a retriever is not used.

@staticmethod
def import_retriever(retriever_type: str) -> type[DocumentRetriever]:
    """
    Imports the retriever with the specified type.
    This method dynamically imports the retriever class based on the type provided.

    We dynamically import the retriever class so that if a retriever has specific 
    requirements but is not used, the user does not need to install the dependencies.

    Args:
        retriever_type: The type of retriever to import
    Returns:
        The retriever class
    Raises:
        ImportError: If required dependencies are not installed
        ValueError: If retriever type is not supported
    """
    if retriever_type == DocumentRetrieverType.NAME.value:
        try:
            from sqa_system.retrieval.implementations.Document.document_retriever\
                import DocumentRetriever
            return DocumentRetriever
        except ImportError as e:
            raise ImportError(
                f"Document retriever requires additional dependencies: {e}"
            ) from e
    ...

🥳 That's it! You have successfully added a new retriever to the SQA system. You can now use it in your experiments and configurations.