folder_structure - Sidies/MasterThesis-HubLink GitHub Wiki
In this wiki page, an overview of the folder structure of the Scholarly Question Answering (SQA) system is provided. In the following, we provide a description for each component within the SQA system. The page is structured as follows:
- Data Component
- Experiments Component
- App Component
- Core Component
- Experimentation Component
- Knowledge Base Component
- Pipeline Component
- QA Generation Component
- Retrieval Component
- Tests
The data folder contains all the data used in the SQA system. The data is organized into different subfolders, each serving a specific purpose. Below is a table that describes the contents of each subfolder:
Directory | Reference | Description |
---|---|---|
data |
../blob/experiments/sqa-system/data/ | Contains the data for the Question-Answering System. |
data/cache/ |
../blob/experiments/sqa-system/data/cache/ | The cache folder of the SQA system. Any component or retriever can use this folder to store its cache. For example, here the indexes of the HubLink retriever are stored. |
data/configs/ |
../blob/experiments/sqa-system/data/configs/ | Contains the default configuration files for the Question-Answering System which are used within the CLI. This includes default configurations for LLMs, datasets, pipelines, pipes, qa-generation and experiments. |
data/evaluation_results/ |
../blob/experiments/sqa-system/data/evaluation_results/ | Directory that contains the evaluation results for the Question-Answering System. After an experiment is completed over the CLI command line, the results will be stored here. |
data/external/ |
../blob/experiments/sqa-system/data/external/ | Contains external data like scrapped data that is used for the generation of the knowledge bases of the system. |
data/file_paths/ |
../blob/experiments/sqa-system/data/file_paths/ | Directory for the FilePathManager ../blob/experiments/sqa-system/sqa_system/core/data/file_path_manager.py that manages each file used in the system to provide easy access for other components. |
data/knowledge_base/ |
../blob/experiments/sqa-system/data/knowledge_base/ | If a knowledge base serializes its data, it will be stored here. |
data/paper_extraction/ |
../blob/experiments/sqa-system/data/paper_extraction/ | The PaperContentExtractor ../blob/experiments/sqa-system/sqa_system/core/data/extraction/paper_content_extractor.py of the SQA system which extract structured data from the fulltexts of papers caches the extractions in this folder. |
data/prompts/ |
../blob/experiments/sqa-system/data/prompts/ | Contains all the prompts for the system which are managed by the PromptProvider ../blob/experiments/sqa-system/sqa_system/core/language_model/prompt_provider.py. |
The experiments are documented in the experiments
folder. Inside of this folder is a detailed README.md
file that explains the experiments in detail.
Directory | Reference | Description |
---|---|---|
experiments |
../blob/experiments/sqa-system/experiments/ | Documents the experiments that we conducted on the SQA system. |
experiments/1_experiment/ |
../blob/experiments/sqa-system/experiments/1_experiment/ | The first experiment includes the parameter selection process and the final comparison of the retrievers. All questions of the GQM plan are answered with the results provided in the final comparison folder but Q5. |
experiments/2_experiment/ |
../blob/experiments/sqa-system/experiments/2_experiment/ | The second experiment includes test runs for four different graph variants of the ORKG. It is solely used to answer Q5 of the GQM plan. |
experiments/debugging/ |
../blob/experiments/sqa-system/experiments/debugging/ | The debugging folder contains testings that we conducted to guide the development. |
experiments/qa_datasets/ |
../blob/experiments/sqa-system/experiments/qa_datasets/ | The qa_datasets folder contains the datasets used for the experiments and their creation scripts. The datasets are divided into two folders full and reduced . The full folder contains the full datasets that have been used for the experiments while the reduced folder contains the reduced dataset that has been used for the parameter selection process. |
The app component contains the logic for the CLI application of the SQA system. The most important classes within the app folder are:
-
CLIController
../blob/experiments/sqa-system/sqa_system/app/cli/cli_controller.py: This is the main class responsible for managing the CLI application. It orchestrates the entire CLI process, including command handling and user interaction.
The app component is divided into different subfolders, each serving a specific purpose. Below is a table that describes the contents of each subfolder:
Directory | Reference | Description |
---|---|---|
sqa_system/app/ |
../blob/experiments/sqa-system/sqa_system/app/ | Contains the frontend logic of the SQA-system implemented as a CLI application. |
sqa_system/app/cli/ |
../blob/experiments/sqa-system/sqa_system/app/cli/ | Contains the CLI logic of the SQA-system. Main entry is the CLIController ../blob/experiments/sqa-system/sqa_system/app/cli/cli_controller.py. |
sqa_system/app/cli/handler/ |
../blob/experiments/sqa-system/sqa_system/app/cli/handler/ | Handlers to manage CLI commands: configuration, experiments, pipelines, secrets, annotations, and QA-dataset commands. |
sqa_system/app/cli/menu/ |
../blob/experiments/sqa-system/sqa_system/app/cli/menu/ | Menus for displaying CLI options and configurations including DatasetConfigMenu, PipeConfigMenu, and QAGeneratorMenu. |
The core component contains the supporting structures of the SQA system and is divided into different subfolders, each serving a specific purpose. Below is a table that describes the contents of each subfolder:
Directory | Reference | Description |
---|---|---|
sqa_system/core/ |
../blob/experiments/sqa-system/sqa_system/core/ | Contains the main logic for configurations, language models, logging, base classes, and data models. |
sqa_system/core/base/ |
../blob/experiments/sqa-system/sqa_system/core/base/ | Contains base classes for the system. |
sqa_system/core/config/ |
../blob/experiments/sqa-system/sqa_system/core/config/ | Contains configuration models, managers, and factories for different components. |
sqa_system/core/config/config_manager/ |
../blob/experiments/sqa-system/sqa_system/core/config/config_manager/ | Contains the ConfigurationManager classes for each unique configuration in the system. These are used to manage the configuration files and provide easy access. |
sqa_system/core/config/factory/ |
../blob/experiments/sqa-system/sqa_system/core/config/factory/ | Factory for creating various configurations. |
sqa_system/core/config/models/ |
../blob/experiments/sqa-system/sqa_system/core/config/models/ | Contains all the configuration data models that are used within the system. |
sqa_system/core/data/ |
../blob/experiments/sqa-system/sqa_system/core/data/ | Contains models and logic for file management, secrets, caching, and datasets. |
sqa_system/core/data/data_loader/ |
../blob/experiments/sqa-system/sqa_system/core/data/data_loader/ | Contains logic for loading data from various sources (e.g., JSON, CSV). |
sqa_system/core/data/models/ |
../blob/experiments/sqa-system/sqa_system/core/data/models/ | Contains all the data models that are used within the system. |
sqa_system/core/language_model/ |
../blob/experiments/sqa-system/sqa_system/core/language_model/ | Contains logic for loading language models and embedding models. This access is orchestrated by the LLMProvider ../blob/experiments/sqa-system/sqa_system/core/language_model/llm_provider.py class. |
sqa_system/core/logging/ |
../blob/experiments/sqa-system/sqa_system/core/logging/ | Contains logging logic and configurations. The settings for the logging are located here: ../blob/experiments/sya-system/data/configs/logging_config/logging_config.yaml. |
The experimentation component contains the logic for running and visualizing experiments. The most important classes within the experimentation folder are:
-
ExperimentRunner
../blob/experiments/sqa-system/sqa_system/experimentation/experiment_runner.py: This is the main class responsible for running the experiments. It orchestrates the entire experiment process, including data loading, pipeline preparation, and evaluation. -
ExperimentVisualizer
../blob/experiments/sqa-system/sqa_system/experimentation/utils/visualizer/experiment_visualizer.py: This is the class responsible for generating tables and plots about the experiment results. It provides various visualization methods to analyze the performance of different configurations and strategies. -
ExperimentConfigBuilder
../blob/experiments/sqa-system/sqa_system/experimentation/experiment_config_builder.py: This is a helper class that is useful to prepare experiment configurations to setup the experiment.
The experimentation component is divided into different subfolders, each serving a specific purpose. Below is a table that describes the contents of each subfolder:
Directory | Reference | Description |
---|---|---|
sqa_system/experimentation/ |
../blob/experiments/sqa-system/sqa_system/experimentation/ | Contains the experiment runner and visualizer. The ExperimentRunner ../blob/experiments/sqa-system/sqa_system/experimentation/experiment_runner.py is the main class responsible for running the experiments. The ExperimentVisualizer ../blob/experiments/sqa-system/sqa_system/experimentation/utils/visualizer/experiment_visualizer.py is the class responsible for generating tables and plots about the experiment results. |
sqa_system/experimentation/evaluation/ |
../blob/experiments/sqa-system/sqa_system/experimentation/evaluation/ | In this folder, the Evaluator classes are impkemented. These are the main classes reponsible for evaluating the results of the experiments by calculating the metric scores. |
sqa_system/experimentation/file_evaluator/ |
../blob/experiments/sqa-system/sqa_system/experimentation/file_evaluator/ | Contains the FileEvaluator class which is responsible for calculating metrics based on Evaluator classes even after the experiment is finished. This is useful for example if new metrics should be added after the experiment is finished. |
sqa_system/experimentation/utils/ |
../blob/experiments/sqa-system/sqa_system/experimentation/utils/ | Contains utility classes for the experiments. |
sqa_system/experimentation/utils/visualizer/ |
../blob/experiments/sqa-system/sqa_system/experimentation/utils/visualizer/ | Contains the ExperimentVisualizer class which is responsible for generating tables and plots about the experiment results. |
sqa_system/experimentation/utils/executor/ |
../blob/experiments/sqa-system/sqa_system/experimentation/utils/executor/ | Contains experiment executors that allow to run several strategies for conducting the experiments like parallel processing, sequential processing or experimentation without evaluation. |
The knowledge base component contains the logic for the knowledge bases of the system. The most important classes within the knowledge base folder are:
-
KnowledgeGraphManager
../blob/experiments/sqa-system/sqa_system/knowledge_base/knowledge_graph/storage/knowledge_graph_manager.py: This is the main class responsible for loading and providing access to the knowledge graphs. -
VectorStoreManager
../blob/experiments/sqa-system/sqa_system/knowledge_base/vector_store/storage/vector_store_manager.py: This is the main class responsible for loading and providing access to the vector stores.
The knowledge base component is divided into different subfolders, each serving a specific purpose. Below is a table that describes the contents of each subfolder:
Directory | Reference | Description |
---|---|---|
sqa_system/knowledge_base/ |
../blob/experiments/sqa-system/sqa_system/knowledge_base/ | Contains logic for the system's knowledge bases. |
sqa_system/knowledge_base/knowledge_graph/storage/ |
../blob/experiments/sqa-system/sqa_system/knowledge_base/knowledge_graph/storage/ | Contains logic for knowledge graphs, including both storage and retrieval. |
sqa_system/knowledge_base/knowledge_graph/storage/storage/factory/ |
../blob/experiments/sqa-system/sqa_system/knowledge_base/knowledge_graph/storage/factory/ | Contains all factory classes for generating knowledge graphs based on configuration. |
sqa_system/knowledge_base/knowledge_graph/storage/implementations/ |
../blob/experiments/sqa-system/sqa_system/knowledge_base/knowledge_graph/storage/implementations/ | Contains the concrete implementations of all Knowledge Graphs that are supported in the SQA system. |
sqa_system/knowledge_base/knowledge_graph/storage/utils/ |
../blob/experiments/sqa-system/sqa_system/knowledge_base/knowledge_graph/storage/utils/ | Contains utility classes for Knowledge Graphs such as conversion of graphs to texts, filtering, path and subgraph building. |
sqa_system/knowledge_base/vector_store/ |
../blob/experiments/sqa-system/sqa_system/knowledge_base/vector_store/ | Contains the logic for the vector stores of the system. |
sqa_system/knowledge_base/vector_store/chunking/ |
../blob/experiments/sqa-system/sqa_system/knowledge_base/vector_store/chunking/ | Contains the logic for the chunking strategies of the system. |
sqa_system/knowledge_base/vector_store/storage/ |
../blob/experiments/sqa-system/sqa_system/knowledge_base/vector_store/storage/ | Contains the implementation logic of the vector stores. |
sqa_system/knowledge_base/vector_store/storage/factory/ |
../blob/experiments/sqa-system/sqa_system/knowledge_base/vector_store/storage/factory/ | Contains the factory classes that allow the system to generate the vector stores based on the VectorStoreConfig . |
sqa_system/knowledge_base/vector_store/storage/implementations/ |
../blob/experiments/sqa-system/sqa_system/knowledge_base/vector_store/storage/implementations/ | Contains the implementations of the vector store adapter which are wrappers that allow vector stores to integrate with the system. |
The pipeline component contains the logic for the pipelines that can be used for processing data in the system. The most important classes within the pipeline component are:
-
RetrievalPipeline
../blob/experiments/sqa-system/sqa_system/pipeline/retrieval_pipeline.py: This is the main class responsible for managing the retrieval pipeline. It orchestrates the entire retrieval process, including data loading, retrieval, and processing.
Directory | Reference | Description |
---|---|---|
sqa_system/pipe/ |
../blob/experiments/sqa-system/sqa_system/pipe/ | Contains the pipe implementations of the system. |
sqa_system/pipe/base/ |
../blob/experiments/sqa-system/sqa_system/pipe/base/ | Contains the base classes for the pipes. |
sqa_system/pipe/factory/ |
../blob/experiments/sqa-system/sqa_system/pipe/factory/ | Contains the factory class PipeFactory that is responsible for the creation of pipes based on configurations. |
sqa_system/pipe/generation/ |
../blob/experiments/sqa-system/sqa_system/pipe/generation/ | Contains the generation pipe implementations which are responsible for generating the answer to the question based on the context that the PipeIOData model has been filled with by previous pipes. |
sqa_system/pipe/post_retrieval/ |
../blob/experiments/sqa-system/sqa_system/pipe/post_retrieval/ | Contains the post retrieval pipe implementations which are responsible for processing the data that the PipeIOData model has been filled with by previous pipes before it is forwarded to the generation pipe. |
sqa_system/pipe/pre_retrieval/ |
../blob/experiments/sqa-system/sqa_system/pipe/pre_retrieval/ | Contains the pre retrieval pipe implementations which are responsible for processing the data that the PipeIOData model has been filled with by previous pipes before it is forwarded to the retriever. |
sqa_system/pipe/retrieval/ |
../blob/experiments/sqa-system/sqa_system/pipe/retrieval/ | Contains the retrieval pipe implementations which are responsible for retrieving data from the underlying knowledge base. |
sqa_system/pipeline/ |
../blob/experiments/sqa-system/sqa_system/pipeline/ | Contains the RetrievalPipeline ../blob/experiments/sqa-system/sqa_system/pipeline/retrieval_pipeline.py which is the main class responsible for managing the retrieval pipeline. It orchestrates the entire retrieval process, including data loading, retrieval, and processing. |
sqa_system/pipeline/factory/ |
../blob/experiments/sqa-system/sqa_system/pipeline/factory/ | Contains the factory class that allows the system to create the pipeline based on the PipelineConfig. |
The QA generation component contains the logic for the QA generation of the system. The most important classes within the QA generation component are:
-
FromTopicEntityGenerator
../blob/experiments/sqa-system/sqa_system/qa_generator/strategies/publication_subgraph_strategy/from_topic_entity_generator.py: This is the main class responsible for conducting the semi-automatic QA generation using the subgraph extraction strategy. -
ClusterBasedQuestionGenerator
../blob/experiments/sqa-system/sqa_system/qa_generator/strategies/clustering_strategy/cluster_based_question_generator.py: This is the main class responsible for conducting the semi-automatic QA generation using the clustering strategy.
The QA generation component is divided into different subfolders, each serving a specific purpose. Below is a table that describes the contents of each subfolder:
Directory | Reference | Description |
---|---|---|
sqa_system/qa_generator/ |
../blob/experiments/sqa-system/sqa_system/qa_generator/ | Contains the logic for the QA generation of the system. |
sqa_system/qa_generator/qa_dataset_graph_converter/ |
../blob/experiments/sqa-system/sqa_system/qa_generator/qa_dataset_graph_converter/ | Contains scripts to convert QA datasets based on the ORKG to different graph variants. |
sqa_system/qa_generator/strategies/ |
../blob/experiments/sqa-system/sqa_system/qa_generator/strategies/ | Contains the different strategies for the QA generation. |
sqa_system/qa_generator/strategies/clustering_strategy/ |
../blob/experiments/sqa-system/sqa_system/qa_generator/strategies/clustering_strategy/ | Contains the clustering strategy for the QA generation. |
sqa_system/qa_generator/strategies/publication_subgraph_strategy/ |
../blob/experiments/sqa-system/sqa_system/qa_generator/strategies/publication_subgraph_strategy/ | Contains the publication subgraph strategy for the QA generation. |
sqa_system/qa_generator/utils |
../blob/experiments/sqa-system/sqa_system/qa_generator/utils/ | COntains utility classes for the QA generator implementations. |
The retrieval component includes the implementations of all retrievers within the SQA system. The most important classes within the retrieval component are:
-
HubLinkRetriever
../blob/experiments/sqa-system/sqa_system/retrieval/implementations/HubLink/hublink_retriever.py: This contains the implementation of our proposed KGQA approach named HubLink.
The retrieval component is divided into different subfolders, each serving a specific purpose. Below is a table that describes the contents of each subfolder:
Directory | Reference | Description |
---|---|---|
sqa_system/retrieval/ |
../blob/experiments/sqa-system/sqa_system/retrieval/ | Contains the implementations of the retrievers. |
sqa_system/retrieval/base/ |
../blob/experiments/sqa-system/sqa_system/retrieval/base/ | Contains the base classes for the retrievers which they need to implement to be compatible with the SQA system. |
sqa_system/retrieval/factory/ |
../blob/experiments/sqa-system/sqa_system/retrieval/factory/ | Contains the factory classes that allow the system to create the retrievers based on the RetrieverConfig. |
sqa_system/retrieval/implementations/ |
../blob/experiments/sqa-system/sqa_system/retrieval/implementations/ | Contains the implementations of the retrievers. |
Several tests are provided to test various functionalities within the SQA system. These tests are located in the sqa_sytem/tests
../blob/experiments/sqa-system/tests/ folder and mirror the folder structure of the SQA system.