adding_retriever - Sidies/MasterThesis-HubLink GitHub Wiki
This page explains how to add a new retriever to the SQA system. The retriever is responsible for retrieving relevant documents or triples from a selection of documents or triples based on a given query. The retriever is parts of the RAG pipeline implemented in the SQA system.
Retrievers in the SQA system are located in the retrieval
folder: ../blob/experiments/sqa-system/sqa_system/retrieval/implementations/. To add a new retriever, you need to create a new folder for the retriever in the implementations
folder. The folder name should be the name of the retriever. Inside of the folder, you can then add your python implementations for the retriever. The entry point should be a main python file which is a subclass of the Retriever
class located in the implementations/base
folder: ../blob/experiments/sqa-system/sqa_system/retrieval/implementations/base. You should select between two types of retrievers. A DocumentRetriever
or a KnowledgeGraphRetriever
.
- The
DocumentRetriever
works with text documents as input sources, specifically the full text of scientific publications. These documents may be pre-processed by the respective retriever before a question is submitted for which the retriever then finds relevant text passages in the documents. Unlike the KG variant, theDocumentRetriever
is not bound to a specific knowledge base. Instead, the correspondingRetrievalConfig
specifies the dataset configuration, for the data that should be used for retrieval. - The
KnowledgeGraphRetriever
operates directly on a Knowledge Graph (KG). Instead of documents, it uses a connection to an existing KG. Similar to the document-based version, the retriever can index or prepare the graph before query time. When a question is submitted, the retriever has to find the relevant triples in the KG.
Both retriever types are subclasses of the Retriever
class and work relatively similar. The main difference is the input data type and the configuration class used. For either, you should implement the retrieve()
method, which is the main method to conduct the retrieval process which is called by the RetrievalPipeline
: see ../blob/experiments/sqa-system/sqa_system/pipeline/retrieval_pipeline.py.
A retriever naturally has its own configuration parameters. To add these parameters, you should add them to the ADDITIONAL_CONFIG_PARAMS
list in the class. The parameters should be of type AdditionalConfigParameter
for example:
AdditionalConfigParameter(
name="embedding_config",
description="Configuration for the embeddings model.",
param_type=EmbeddingConfig,
available_values=[],
default_value=_DEFAULT_EMBEDDING_CONFIG
)
After the implementation of the retriever, you need to register it in the DocumentRetrieverFactory
or KnowledgeGraphRetrieverFactory
classes located in ../blob/experiments/sqa-system/sqa_system/retrieval/factory/. It depends on the type of retriever you implemented. The factory is responsible for creating instances of the retriever based on the configuration.
First, you need to add the retriever in the enum and give it a representative name which will be used in the configuration:
class KnowledgeGraphRetrieverType(Enum):
"""
A enum class that represents the different types of knowledge graph retrievers
that are available in the system.
"""
NAME = "name"
...
class DocumentRetrieverType(Enum):
"""
This is the main enum that maps a string to the retriever type.
It is used in the configuration files to specify exactly the
retriever to use.
If new retrievers are added, they should be added here as well.
"""
NAME = "name"
...
Next, you need to add the import of the retriever in the import_retriever()
method of the factory class. This method is responsible for importing the retriever class based on the type provided. The import is done dynamically to avoid unnecessary dependencies if a retriever is not used.
@staticmethod
def import_retriever(retriever_type: str) -> type[DocumentRetriever]:
"""
Imports the retriever with the specified type.
This method dynamically imports the retriever class based on the type provided.
We dynamically import the retriever class so that if a retriever has specific
requirements but is not used, the user does not need to install the dependencies.
Args:
retriever_type: The type of retriever to import
Returns:
The retriever class
Raises:
ImportError: If required dependencies are not installed
ValueError: If retriever type is not supported
"""
if retriever_type == DocumentRetrieverType.NAME.value:
try:
from sqa_system.retrieval.implementations.Document.document_retriever\
import DocumentRetriever
return DocumentRetriever
except ImportError as e:
raise ImportError(
f"Document retriever requires additional dependencies: {e}"
) from e
...
🥳 That's it! You have successfully added a new retriever to the SQA system. You can now use it in your experiments and configurations.