architecture - Sidies/MasterThesis-HubLink GitHub Wiki
To demonstrate and evaluate the performance of our HubLink Retriever, we developed a Python framework that enables testing and evaluation of various retrieval approaches. We refer to the framework as the Scholarly Question Answering (SQA) system. The system is designed to easily run different Retrieval Augemented Generation (RAG) pipeline configurations and evaluate the results.
On this wiki page, we are going to detail the overall architecture of the SQA system.
The SQA framework is designed with extendability in mind. Consequently, it provides well-defined abstract interfaces and base classes to facilitate the straightforward integration of new KGQA approaches, new KGs or vector databases, different LLMs, and custom intermediate pipeline stages, ensuring the framework can adapt to future research directions and technological advancements. The framework employs a modular software architecture, as depicted above to promote a clear separation of concerns, where each component encapsulates a specific set of functionalities. In the following, we briefly introduce each component of the SQA framework before explaining them in detail:
Configuration Component The initialization and experimental setup are fundamentally based on the Configuration component, which parses and validates JSON-based configuration files that define parameters and selected modules for all other components.
Data Component Data loading, preprocessing and distribution are the responsibility of the data component. It manages metadata and content data from publications, as well as QA pairs from JSON and CSV sources, and makes these accessible to other parts of the system.
Pipeline Component The core logic of the question-answering process resides within the Pipeline component. This component orchestrates the sequence of pipeline operations (pre-retrieval processing, context retrieval, post-retrieval processing, and final answer generation).
Retriever Component Integrated closely with the pipeline, the Retriever component performs the task of searching the designated KB (KG or vector index) to find contextually relevant information passages or facts relevant to the input question.
Language Model Component A unified interface for interacting with local or API-based LLMs is provided by the language model component. It handles LLM access for other components of the system and parses the response.
Knowledge Base A KB stores and provides access to the data in the system. In the system, two types of KBs are supported, which are KGs and vector indices. Both are unified in the knowledge base component.
Experiment Component The experiment component provides the functionality to perform systematic evaluations. It iterates over specified configurations, executes the QA pipeline for each test question, collects results, computes relevant RAG metrics, and stores these findings for later analysis.
QA Generator Component This component helps to create evaluation datasets by implementing semi-automated methods to generate KGQA pairs from a KG using an LLM.
App Component User interaction is managed through the app component, which offers a CLI application. This tool allows users to manage configurations, execute experiments, ingest data, and perform interactive querying of pipelines.
The SQA system processes several types of data and features the structured LLM-based extraction of text. This section details the primary inputs and outputs handled by the system and describes the process of extracting structured data from paper texts.
Any data that is inserted into the system has to be converted to either a Publication
or QAPair
object as depicted in the image above. We introduce them in the following:
Publication Data:
An important data input of the SQA framework is the metadata and content texts of scientific publications. These are ingested into the system from raw data, for example, in JSON format. For our experiments, we implemented a dedicated JsonDataLoader
which loads the publication data and converts it into the internal Publication
data model that encapsulates both metadata and available full text for each scientific publication. To handle multiple publications, the PublicationDataset
data model is used. This data model stores a collection of publications and provides easy access to them. The dataset is managed by the DatasetManager
class which enables other components to access datasets by providing configuration files. The manager then initiates the loading and handles the persistence.
Question-Answer Pair Data:
Question-answer pairs include the questions and ground truth to conduct the experiments. These are internally stored as QAPair
objects in the SQA system and loaded from data loader implementations from raw formats such as CSV. For our experiments, we implemented a CSVDataLoader
that allows the serialized KGQA dataset to be imported and used for the experiments. The QAPair
data model has the following attributes:
- UID: A unique identification identifier that uniquely identifies the specific question-answer pair.
- Question: The natural language question posed to the QA system.
- Topic Entity: An optional field specifying an entity within the target KG, which potentially serves as an entry point for certain retrieval algorithms. It is optional as not all retrievers require this information.
- Golden Answer: A reference answer to the question, formulated in natural language, considered correct for evaluation purposes.
- Golden Triples: A set of triples present in the target KG that represent the factual basis for the Golden Answer.
- Golden Text Chunks: The text passages from source publications containing information that correspond to the Golden Answer and align with the Golden Triples.
- Source IDs: Identifiers of the source publications from which the Golden Answer, Golden Triples, and Golden Text Chunks were derived.
Similarly to the publication data, a collection of QAPair
is handled by a QADataset
data model. This model handles the access to the pairs and their persistence.
Depending on the executed workflow (e.g., experimentation, interactive querying, QA generation), the SQA system produces several outputs:
Experiment Results:
When running an experiment, for each question several outputs are produced. These outputs include the generated answer and the retrieved context that both relate to the provided question. These outputs are then evaluated against the golden ground truth data provided by the QADataset
using dedicated Evaluator
objects. These produce comprehensive evaluation reports containing various RAG metrics (e.g., context relevance, answer faithfulness) and other measurements, which are then aggregated and exported in CSV
format. Subsequently, an ExperimentVisualizer
can be used that generates plots and tables to visualize the results.
Simple Answer: When executing a RAG pipeline in interactive mode, the primary output returned to the user is the generated natural language answer to the question posed.
QA Generation Data:
Another output comprises the semi-automatically generated QAPair
objects, which are stored in a QADataset
structure and serialized into CSV
format for subsequent use in evaluations.
The SQA system also provides the ability to extract scientific content directly from the text of scientific publications. This process, orchestrated by the ContentExtractor
class using a configured LLM, transforms unstructured text into structured data suitable for enriching KGs. The extraction process involves the following steps:
- Text Chunking: The source text of the publication is segmented into smaller, manageable chunks suitable for processing by the context window limits of the LLM.
- Chunk Extraction: Each text chunk is processed by the LLM. To enhance completeness, this extraction is done multiple times for the same chunk. In subsequent extractions for the same chunk, previously extracted information is aggregated and provided back to the LLM as context to minimize redundancy.
- Tracing: Each extracted data item is mapped back to its location in the original publication source. This trace provides provenance for the information extracted.
As mentioned above, the core extraction mechanism involves prompting the LLM. This is accomplished by populating a predefined JSON
schema which defines the target information types, attributes, and relationships to be extracted (e.g., research methods used, research questions, key findings). The task of the LLM is to analyze the provided text chunk and accurately populate this schema with corresponding information found within the text. The data extracted from the schema can subsequently be utilized, for instance, in the construction of a KG.
A primary goal during the development of the SQA system was to ensure the easy definition and reproducibility of experiments. To achieve this, the system utilizes configuration files stored in JSON
format. This storage method serializes configurations, enabling their persistence for future use and allowing experiments to be repeated accurately by using the corresponding files.
Each component of the SQA system is controlled through a dedicated configuration file that defines its required parameters. An important characteristic of a configuration is that it can be hierarchically defined, meaning that one configuration may include another. The image above shows the KG configuration model as an example. It illustrates the parameters required, demonstrating that configurations can contain both primitive data types such as strings and integers and complex structures like nested configuration objects.
To load the JSON files into the SQA system the ConfigurationManager
is responsible. This manager interprets the file to confirm that it is correctly structured and provides the filled configuration object to the relevant component that needs to be initialized. Furthermore, for reproducibility and caching, it is essential to determine whether two configurations are identical. This is achieved by calculating a cryptographic hash value for each configuration object. By calculating this hash over the content of the configuration, we ensure that the values of two configurations are only identical if and only if their contents are identical. This mechanism makes it easy to determine whether two configurations are the same. This functionality is used by components in the system to avoid duplicate instantiation in caching scenarios such as storing LLMs models or KGs.
The RAG process is implemented within the SQA system using a pipeline architecture facilitated by the LangChain library [1]. This pipeline accepts a question and optionally a topic entity as input. The pipeline processes this input and returns a PipeIO
object populated with the results. The PipeIO
object acts as a data container, progressively accumulating information such as the retrieved contexts and the generated answer as it traverses the pipeline stages. It also collects tracking data like carbon emissions or LLM token usage.
In the following sections, we first describe the retrieval pipeline. Then we detail the PipeIO
data model.
A standard pipeline comprises four distinct stages, referred to as pipes. These pipes are executed sequentially in a predefined order: Execution starts with the Pre-Retrieval Pipe, followed by the Retrieval Pipe, then the Post-Retrieval Pipe, and concludes with the Generation Pipe. These pipes are explained in the following:
Pre-Retrieval Pipe: This pipe is responsible for processing the input question before retrieval takes place. For example, the question given as input into the pipeline can be expanded to include synonyms or related terms to improve recall.
Retrieval Pipe:
This is the pipe that is responsible for the extraction of relevant contexts from the underlying KB. All retrievers are applied within this pipeline step. The task of the retriever is to identify and retrieve relevant context passages or structured data from the designated KB. The retrieved contexts are then stored in the PipeIO
object. In addition, certain retrievers may generate a final answer directly. If the retriever does not generate an answer, a subsequent Generation Pipe is required to generate the answer.
Post-Retrieval Pipe: This optional stage processes the contexts retrieved in the previous stage. Common operations include filtering irrelevant passages, reranking of contexts based on relevance or other criteria, or selecting a subset of contexts based on a specified criteria.
Generation Pipe:
This pipe is responsible for synthesizing the final natural language answer. A configured LLM is invoked, provided with the original question and the retrieved contexts. The task of the LLM is to generate a coherent and factually grounded answer based on this input. The generated answer is then stored in the PipeIO
object, completing the pipeline execution. This stage is optional if the retriever already produced a final answer.
The data model of the PipeIO
object is shown in the image above. It contains several attributes that are populated by the individual pipes in the pipeline. Initially, the PipeIO
data model is created by the pipeline, where the attributes initial_question
, retrieval_question
, and topic_entity
are initialized. The distinction between these two types of question exists because when the question is modified by the pre-retrieval pipe, the original question should still be saved. Consequently, the initial_question
attribute is read-only
, while retrieval_question
can be modified. Furthermore, the topic entity can optionally be added. This allows retrievers to have an entry point in the KG if they require it. Moreover, the PipeIO
object contains the attributes retrieved_context
and generated_answer
. These attributes are filled during execution.
The main class in the retrieval component is the Retriever
, which is responsible for finding relevant contexts from a KB to answer a question. There are two sub-classes available for the Retriever
: the DocumentRetriever
and the KnowledgeGraphRetriever
, each handling different types of data:
Document-based Retriever works with text documents as input sources, specifically the full text of scientific publications. Unlike the KG variant, the DocumentRetriever
is not bound to a specific KB. Instead, the corresponding DocumentRetrievalConfig
specifies the configuration of the dataset which contains the data that should be used for retrieval instead of a KG.
Knowledge Graph-based Retriever operates directly on a KG. Instead of documents, it uses a KG as the underlying data store. The configuration for the KG is provided with the KGRetrievalConfig
during initialization of the retriever.
Once the retriever is implemented in the system, it is used in a pipeline to allow the retrieval of context for questions and to generate answers.
Since many components of the SQA system work with LLMs, a dedicated Language Model component is provided. This component offers a unified interface to make requests to an LLM. The requests, in conjunction with their corresponding configuration, are handled by the LLMProvider
, which is responsible for initializing the model and establishing the connection.
The implementation of the models is achieved through adaptation of the BaseLanguageModel
and Embeddings
classes from the LangChain [2] library. The advantage of using the LangChain framework lies in its support for a wide range of models, a well-maintained interface, and easy integration of new models. Two types of adapters are implemented, which are shown in the image above. The LLMAdapter
is responsible for sending requests to the LLM and processing the responses. The EmbeddingAdapter
, on the other hand, handles the transformation of texts into vectors.
The SQA system provides two different types of KBs as shown in the image above: KGs and vector stores. These KBs can be used by retrievers to find relevant contexts for answering questions. In the following, we detail their implementations in the SQA system:
Knowledge Graphs are represented as RDF graphs in the SQA system. These graphs consist of triples represented with the Triple
data model, which consists of Knowledge
objects as shown in Figure 7. Moreover, to manage and provide KnowledgeGraph
objects in the SQA system, the KnowledgeGraphManager
is responsible. This manager initializes the graphs based on the provided configuration and caches their connection to allow other components to query the graph. The preparation or creation of a KG is facilitated by the KnowledgeGraphFactory
. This factory receives the graph configuration file and initializes or creates the KG based on the given parameters.
Vector Stores are specialized data structures that enable efficient vector storage and querying. They are used to store texts that have been transformed into a low-dimensional vector space using an embedding model. These vectors can then be used to calculate the similarities between texts. In the SQA system, the VectorStore
implementation from the LangChain library is adapted using a VectorStoreAdapter
. Similarly to knowledge graphs, a VectorStoreManager
is responsible for initializing, managing, and providing the vector stores.
One of the main features of the SQA system is the execution of experiments. These are defined using configuration files as described in the Configuration section. The UML model of an ExperimentConfig
is shown in the image above. This configuration contains all parameters necessary for conducting an experiment:
- Base Pipeline Config is the configuration of the pipeline used for answering questions. It contains all necessary parameters to initialize a RAG pipeline.
-
Parameter Ranges is a list of parameters that allows varying the pipeline parameters within an experiment, which is useful for studying the effects of parameters on the results. The parameter configurations specified here are used by the
ExperimentRunner
to create multiple pipelines based on the base pipeline configuration, which are then executed in batch. - Evaluators is a list of configurations for evaluators. These are classes in the SQA system responsible for creating RAG metrics for the experiment.
-
QA Dataset is the configuration of the
QADataset
used for the experiment. This contains the questions to be answered in the experiment along with the corresponding ground truth data for evaluation.
The ExperimentRunner
class is responsible for executing the experiment based on a configuration. This class takes the configuration and prepares the pipelines, which are then executed sequentially to fill the PipeIO
object with data. Subsequently, the results of the pipeline are evaluated using the evaluators, which are also prepared by the runner. After the experiment has been conducted, the ExperimentVisualizer
class is responsible for visualizing the results in diagrams and tables.
This component is responsible for the semi-automated creation of KGQA pairs, which are stored internally as QAPair
data models. The SQA framework implements two different strategies for generating these QAPair
objects: the Clustering Strategy and the Subgraph Strategy.
In the following, we first describe the clustering and subgraph strategy. Then we explain the context and answer validation of the generated questions and answers.
A significant challenge in generating KGQA datasets is ensuring the completeness of the ground truth, particularly when an answer corresponds to multiple triples in the graph [3]. Ideally, the ground truth associated with a question should encompass all valid, relevant triples from the KG. Incomplete ground truth can lead to misleading evaluations where systems are penalized for retrieving valid facts not included in the reference set. The Clustering Strategy aims to address this challenge by identifying and grouping all potentially relevant facts before generating the question. The strategy operates as follows:
-
Provide Parameters: The strategy is initialized with various parameters. One essential parameter is the topic entity from the graph. This entity serves as the starting point for triple collection and is stored as the topic entity in the resulting
QAPair
object. - Build Publication Subgraphs: The first step is to collect the subgraph for each publication containing triples that conform to user-provided restrictions. This requires the predicate type restriction parameter, specifying the predicate identifier used to initially select relevant subgraphs. For each matching triple containing this predicate, the graph is traversed backward to locate the unique "paper type" triple that acts as the root node for the subgraph of the publication. Once the root is found, the strategy builds the subgraph by traversing outgoing edges until leaf nodes are reached. The result is a collection of publication subgraphs, each guaranteed to contain the specified predicate type.
- Extract Values: In the second step, the strategy extracts values of interest from each publication subgraph found previously, using a predicate value restriction parameter defining the target predicate names. The strategy searches each subgraph for triples with matching predicates. Starting from these matching triples, it traverses all outgoing paths to leaf nodes, extracting their values. This produces a mapping from each extracted value of interest back to its source publication subgraph.
- Cluster Values: These extracted values of interest are then embedded into a vector space using a selected embedding model and subsequently clustered using the DBSCAN [4] algorithm. This step forms clusters grouping semantically similar values. The mapping back to the source publication for each value is maintained throughout this process.
- Additional Restrictions: Based on these semantically coherent clusters, additional restrictions which are also provided as parameters can be applied. Each cluster is processed, and relevant triples conforming to these new restrictions are extracted from the associated publication subgraphs and added to the data payload of the cluster. The result is a set of clusters, each representing a semantic group of values and enriched with additional context triples.
- LLM-based Generation: After the clustering is done, the next step is the generation of the question and answer. Here, a template text and additional instructions for the LLM are provided as parameters to the strategy. The template is a question with placeholders to guide the LLM in the generation process and the additional instructions are appended to the base prompt to further fine-tune the generation process. To generate the question and answer, the clusters are processed one by one and forwarded to the LLM with the prepared prompt. The LLM then generates a question and a corresponding answer given the data from the cluster, the template and instructions.
-
Prepare Ground Truth: The triples accumulated within each cluster during the preceding steps serve as the ground truth. These are stored alongside the generated question and answer in the
QAPair
object. - Manual Validation: Due to the potential for LLM hallucination or failure to adhere to instructions or templates, a final manual validation step is essential. The generated question and answer are checked for correctness, coherence, and adherence to the template. Consequently, the relevance and correctness of the collected ground truth triples must be manually verified.
In the following, we illustrate an example to clarify the functionality of the clustering strategy. For this purpose, we use the following question template: "Which publications investigate the research object [research object name] and evaluate the sub-property [sub-property name]?".
First, we need to ensure that all subgraphs containing the required information are fetched for further processing. In the case of the ORKG, we can directly use the unique identifier of the Research Object predicate:
Next, we define the list of predicate names that should be clustered from the subgraphs. Based on the question, we must specify the corresponding predicate names under which the requested information is stored in the graph. In our case, the predicates are labeled as Research Object
and Sub-Property
. In addition, further parameters can be defined to influence how the information is added to the clusters. For example, we can decide to split the clusters. In this case, the retrieved values are not added to the current cluster. Instead, a separate copy of the cluster is created for each value. This is desired for our question, as we do not want to collect all possible values of research objects and sub-properties in a single cluster but rather focus on specific instances.
At this step, each cluster contains all the publications of the dataset that investigate the same research object and evaluate the same sub-property. These clusters are now individually forwarded to the LLM that generates the question and the answer based on the triples in the cluster.
The second available strategy is the Subgraph Strategy, which generates diverse question-answer pairs related to a single publication at a time. Unlike the cluster-based approach, this strategy does not inherently generate questions spanning multiple publications. Furthermore, it is less restrictive, which can enable the generation of a wider variety of question types. The strategy operates as follows:
- Input Definition: The strategy requires a publication entity from the KG as input, which acts as a topic entity identifier.
- Subgraph Extraction: Starting from the publication entity, the graph is traversed to extract the subgraph of the publication by traversing the graph until the leaf nodes are reached. However, this subgraph is limited to a predefined size to fit within the context window of the LLM.
-
LLM-based Generation: This subgraph is then provided to the LLM with the instruction to generate both a relevant question and the corresponding golden answer. The generation process is guided by requiring the LLM to output a
JSON
structure containing the generated question, the answer and the specific subgraph triples that were used as the basis for generation.
During the generation of QAPairs
, the LLM specifies golden triples that were used for the generation. We observed that this is not always accurate. Either the LLM indicates triples that do not match the generated answer, or the generated answer does not match the triples, indicating that the LLM is hallucinating. Therefore, we ensure through an additional validation process that the generated questions can actually be answered based on the data in the KG and that the created answer truly matches both the question and the data. The validation process works as follows:
- Triple Validation: It may occur that initially a larger set of triples was classified as relevant, but only a subset is actually necessary for answering the question. In this case, an LLM is prompted with a specially prepared validation prompt and the triples as input. The LLM is instructed to ensure that all the information contained in the golden answer is present in the set of specified triples and to remove any triples that are not. This reduces the total set to the actually relevant triples or removes the question if no triples are relevant.
- Answer Validation: Furthermore, another LLM call is made to verify whether the generated answer matches the question.
- Grammar Correction: Since we observed that the generated questions are not always grammatically correct, an additional LLM call is performed to correct the generated question.
-
Manual Validation: Finally, the generated data is manually reviewed before being saved to the final
QADataset
.
The App component includes a CLI application that enables users to operate the SQA system through the command line. It offers the following functionalities:
- Configuration Management: All configurations of components in the SQA system can be generated using the command line. This ensures that the configurations are well-defined.
- Experiment Execution: Experiments can be executed.
- Question Answering: Run pipelines in interactive mode for question answering.
First, the CLI application enables configuration management. Since each configuration requires a different structure of the JSON file, it can be difficult for users unfamiliar with the system to create these files manually. Therefore, the CLI application provides guided configuration creation. This allows users to create configurations step by step and select the necessary parameters.
Furthermore, the CLI application facilitates direct execution of experiments based on created configurations. Users can select from a list of experiment configurations or create them directly. After executing the experiment, users receive a summary of results in the console, while detailed information is stored in a designated folder. This includes a CSV file containing all evaluations and experiment results. The results are also visually presented in diagrams that are saved in the folder. In addition, the configuration files are stored in JSON format in the folder to track which specific configuration led to the results.
Another feature of the CLI application is the interactive execution of pipelines. Users can select a pipeline configuration and ask a question. The pipeline is then executed for the question and the answer is displayed in the console.
[1] LangChain, https://www.langchain.com/ [last accessed on 12.05.2025]
[2] LangChain, https://python.langchain.com/docs/integrations/chat/ [last accessed on 28.01.2025]
[3] What is in the KGQA Benchmark Datasets? Survey on Challenges in Datasets for Question Answering on Knowledge Graphs, Steinmetz et al. 2021: 10.1007/s13740-021-00128-9
[4] A density-based algorithm for discovering clusters in large spatial databases with noise, Ester et al. 1996