adding_a_new_kg - Sidies/MasterThesis-HubLink GitHub Wiki

title: Adding a new Knowledge Graph

This page explains how to add a new Knowledge Graph (KG) to the SQA system. The KG is used to store the triples that are used for the retrieval process. The KG is parts of the RAG pipeline implemented in the SQA system.

The SQA system already provides several KGs including:

ORKG: A implementation for accessing the Open Research Knowledge Graph (ORKG) using the ORKG API.
LocalKnowledgeGraph: A RDFlib implementation for creating a local knowledge graph using RDF files.
RDFFileGraph: A RDFlib implementation for using an existing RDF file as a knowledge graph.

Adding a new Knowledge Graph

The implementations of the KGs are located in the knowledge_base/knowledge_graph/storage/implementations folder: ../blob/experiments/sqa-system/sqa_system/knowledge_base/knowledge_graph/storage/implementations. To add a new KG, you need to create a Python file for the KG in the implementations folder. The file name should be the name of the KG. Inside of the file, you can then add your Python implementations for the KG.

1. Implementation of the Knowledge Graph

The KG should be added as a subclass of the KnowledgeGraph class located in the knowledge_base/knowledge_graph/storage/base folder: ../blob/experiments/sqa-system/sqa_system/knowledge_base/knowledge_graph/storage/base. The KnowledgeGraph class is a base class for all knowledge graphs in the SQA system. It provides a common interface for all knowledge graphs.

You need to implement the following properties and methods in the class:

root_type: This property is used to specify what string is used to identify types in the KG. For example, in the ORKG, the type is "entity_type" and in the RDF file, the type is "rdf:type".
paper_type: This property is used to specify what string is used to identify papers in the KG. For example, in the ORKG, the paper type is "Paper".
get_random_publication(): This method is used to get a random paper from the KG.
is_publication_allowed_for_generation(): This method is accessed during the question-answer pair generation process. It is used to check whether a paper is allowed to be used for the generation process.
validate_graph_connection(): This method is used to validate whether the KG is ready to be used.
get_main_triple_from_publication(): A main triple is a triple that should be used to represent the paper in the KG. For example, in the ORKG, the main triple is the triple that contains the DOI of the paper.
get_relations_of_tail_entity(): This method queries the KG for all triples that have the provided entity as the tail entity. To illustrate, if the input is "Paper1", the method should return all triples that have "Paper1" as the tail entity. The method should return a list of triples in the form of (head, relation, "Paper1").
get_relations_of_head_entity(): This method queries the KG for all triples that have the provided entity as the head entity. To illustrate, if the input is "Paper1", the method should return all triples that have "Paper1" as the head entity. The method should return a list of triples in the form of ("Paper1", relation, tail).
is_intermediate_id(): This method is used to check whether the provided ID can be used to query more information from the KG in its natural direction. Basically, it check whether the entity associated with the ID is a leaf entity in the graph.
get_entities_by_predicate_id(): This method is used to get all entities that have a connection to another entity by a specific predicate. For example, if the input is "hasAuthor", the method should return all entities that have a connection to another entity by the predicate "hasAuthor".
get_entity_ids_by_types(): Given a specific type, the method should return all entities that are of that type. For example, if the input is "Paper", the method should return all entities that are of type "Paper".
get_types_of_entity(): This method is used to return the type(s) of an entity.
get_entity_by_id(): This method is used to return the entity associated with the provided ID. For example, if the input is "Paper1", the method should return the entity in the graph that is associated with "Paper1" which should be only one if the ids in the graph are unique.

2. Implementation of the Knowledge Graph Factory

The factory is responsible for creating instances of the KG based on the configuration. There are two types of factories available in the SQA system both located in the knowledge_base/knowledge_graph/storage/factory/base folder: ../blob/experiments/sqa-system/sqa_system/knowledge_base/knowledge_graph/storage/factory/base.

KnowledgeGraphLoader: This class is responsible for loading or connecting to an existing KG without the intention of creating the graph from scratch or adding data to it. Essentially a read only connection to the KG.
KnowledgeGraphBuilder: This class is responsible for creating a new KG from scratch or adding data to an existing KG.

Your factory implementation should be added to the knowledge_base/knowledge_graph/storage/factory/implementations folder: ../blob/experiments/sqa-system/sqa_system/knowledge_base/knowledge_graph/storage/factory/implementations. To add a new factory, you need to create a Python file for the factory in the implementations folder. When implementing a KnowledgeGraphBuilder, you need to implement the following methods:

get_knowledge_graph_class(): A simple class that just returns the type of the KG that the factory is creating.
_create_knowledge_graph(): This method is used to create the KG. It also receives a list of publications to add their data to the graph. It should return an instance of the KG class.

When implementing a KnowledgeGraphLoader, you need to implement the following methods:

get_knowledge_graph_class(): A simple class that just returns the type of the KG that the factory is creating.
_load_knowledge_graph(): This method is used to load the KG. It should return an instance of the KG class.

Additionally, your implementation might require some configuration parameters. To add these parameters, you should add them to the ADDITIONAL_CONFIG_PARAMS list in the class. The parameters should be of type AdditionalConfigParameter for example:

AdditionalConfigParameter(
        name="orkg_base_url",
        description=("The Base URL of the ORKG instance to connect to."),
        param_type=str,
        default_value="https://sandbox.orkg.org"
    )

3. Registration in the Knowledge Graph Factory Registry

After the implementation of the KG and the factory, you need to register it in the KnowledgeGraphFactoryRegistry class located in ../blob/experiments/sqa-system/sqa_system/knowledge_base/knowledge_graph/storage/knowledge_graph_factory_registry.py. The KnowledgeGraphFactoryRegistry class is responsible for registering all the KGs and their factories in the SQA system. Once your KG and factory are registered, they can be easily accessed through the configuration system of the SQA system.

The registration is done in the _register_factories() method. You need to add the factory of your KG to the _register_factories() method in the KnowledgeGraphFactoryRegistry class:

def _register_factories(self):
    """
    Here we initialize the factories that are available 
    in the QA system. When adding a new graph, it is crucial
    to register it in the registry either at runtime or here.
    After the factory is registered, nothing more needs to be
    done. The Knowledge Graph Manager will now recognize its
    existence.
    """
    self.register_factory("orkg", ORKGKnowledgeGraphFactory)
    self.register_factory("local_rdflib", LocalKnowledgeGraphFactory)
    self.register_factory("rdf_file", RDFFileGraphFactory)
    # Add your new factory here

Note: That the first parameter of the register_factory() method is the name of the factory that will be used in the configuration files to specify exactly the KG to use. The second parameter is the class of the factory that you implemented.

🥳 That's it! You have successfully added a new Knowledge Graph to the SQA system. You can now use it in the retrieval process and in the RAG pipeline.