creating_a_new_qa_dataset - Sidies/MasterThesis-HubLink GitHub Wiki


title: Creating a New KGQA Dataset

This page provides an explanation on how a new KGQA dataset can be created with the help of the SQA system. At the core, the SQA system provides two types of generation strategies:

  1. Clustering Strategy: This strategy identifies and groups all potentially relevant facts before generating the question.
  2. Subgraph Strategy: The focus is on generating diverse question-answer pairs relating to a single publication at a time. Unlike the cluster-based approach, this strategy does not inherently generate questions spanning multiple publications and its less restrictive nature can enable the generation of a wider variety of question types based on the local graph structure.

In the following, we will explain how these strategies can be used to create a new KGQA dataset.

Table of Contents

Creating a New KGQA Dataset

Before you can create a new KGQA dataset, you need to prepare the Knowledge Graph and the LLM adapter. The Knowledge Graph is the source of the information that will be used to generate the questions and answers, while the LLM adapter is responsible for generating the questions based on the context and answer.

You can import the necessary classes from the SQA system as follows:

from sqa_system.core.config.models import LLMConfig, KnowledgeGraphConfig, EmbeddingConfig
from sqa_system.knowledge_base.knowledge_graph.storage import KnowledgeGraphManager
from sqa_system.core.language_model.llm_provider import LLMProvider

The KnowledgeGraphConfig class is used to configure the Knowledge Graph. With this configuration, you use the KnowledgeGraphManager to instantiate a new Knowledge Graph. For example, you can configure the Knowledge Graph as follows:

kg_config = KnowledgeGraphConfig.from_dict({
    "additional_params": {
        "contribution_building_blocks": {
            "Classifications_2": [
                "paper_class",
                "research_level",
                "all_research_objects",
                "validity",
                "evidence"
            ]
        },
        "force_cache_update": True,
        "force_publication_update": False,
        "subgraph_root_entity_id": "R659055",
        "orkg_base_url": "https://sandbox.orkg.org"
    },
    "graph_type": "orkg",
    "dataset_config": {
        "name": "merged_ecsa.json_jsonpublicationloader_limit-1",
        "additional_params": {},
        "file_name": "merged_ecsa_icsa.json",
        "loader": "JsonPublicationLoader",
        "loader_limit": -1
    },
    "extraction_llm": {
        "name": "openai_gpt-4o-mini_tmp0.0_maxt-1",
        "additional_params": {},
        "endpoint": "OpenAI",
        "name_model": "gpt-4o-mini",
        "temperature": 0.0,
        "max_tokens": -1
    },
    "extraction_context_size": 4000,
    "chunk_repetitions": 2
})

graph = KnowledgeGraphManager().get_item(kg_config)

The LLMConfig class is used to configure the LLM adapter. It is then used with the LLMProvider to instantiate a new LLM adapter. For example, you can configure the LLM adapter as follows:

gpt_4o_mini_config = LLMConfig.from_dict({
    "endpoint": "OpenAI",
    "name_model": "gpt-4o-mini",
    "temperature": 0.0,
    "max_tokens": -1
})

gpt_4o_mini = LLMProvider().get_llm_adapter(gpt_4o_mini_config)

The EmbeddingConfig class is used to configure the embedding model. It is only required for the Clustering Strategy. For example, you can configure the embedding model as follows:

embedding_config = EmbeddingConfig.from_dict({
    "name": "openai_text-embedding-3-small",
    "additional_params": {},
    "endpoint": "OpenAI",
    "name_model": "text-embedding-3-small"
})

Using the Subgraph Strategy

The subgraph strategy is realized within the FromTopicEntityGenerator class located in the sqa_system/qa_generator/strategies/publication_subgraph_strategy/ directory: ../blob/experiments/sqa_system/sqa-system/qa_generator/strategies/publication_subgraph_strategy/from_topic_entity_generator.py.

To create a question-answer pair with this class, you need to instantiate the class. You can import it as follows:

from sqa_system.qa_generator.strategies import (
    FromTopicEntityGenerator, 
    FromTopicEntityGeneratorOptions, 
    GenerationOptions)

The class interface of the class looks as follows:

class FromTopicEntityGenerator(SubgraphStrategy):
    def __init__(self,
                 graph: KnowledgeGraph,
                 llm_adapter: LLMAdapter,
                 from_topic_entity_options: FromTopicEntityGeneratorOptions,
                 options: GenerationOptions):

The parameters are:

  • graph: The knowledge graph to be used for the generation.
  • llm_adapter: The LLM adapter which generates the question based on the contexts and the answer.
  • from_topic_entity_options: The options for the subgraph strategy.
  • options: The options for the question generation strategy.

To now generate questions using this strategy, the FromTopicEntityGenerator class needs to be instantiated and then the generate() method can be called. For example:

qa_pairs = []

qa_strategy = FromTopicEntityGenerator(
    graph=graph,
    llm_adapter=gpt_4o_mini,
    from_topic_entity_options=FromTopicEntityGeneratorOptions(
        topic_entity=graph.get_entity_by_id("R870141"),
    ),     
    options=GenerationOptions(
        template_text="What is the replication package link of the paper '[paper_title]'?",
        additional_requirements=[
            "The generated question should include the title of the paper.",
            "The context should only include the triple of replication package link.",
        ],
        validate_contexts=False,
        convert_path_to_text=False,
        classify_questions=False,
    )     
)
qa_pairs.extend(qa_strategy.generate())

In this example, we are creating a question that asks for the replication package link of a specific paper. The topic_entity is set to the paper of interest from which a question should be generated. The template_text examplifies the question that should be generated. The additional_requirements are used to provide additional context or instructions to the LLM to guide the generation process. Here we explain that the generated question should include the title of the paper and that the golden triples should only include the replication package link triple.

Using the Clustering Strategy

The clustering strategy is realized within the ClusterBasedQuestionGenerator class located in the sqa_system/qa_generator/strategies/clustering_strategy/ directory: ../blob/experiments/sqa_system/sqa-system/qa_generator/strategies/clustering_strategy/cluster_based_question_generator.py.

To create a question-answer pair with this class, you need to instantiate the class. You can import it as follows:

from sqa_system.qa_generator.strategies.clustering_strategy.cluster_based_question_generator import (
    ClusterBasedQuestionGenerator, 
    ClusterGeneratorOptions, 
    AdditionalInformationRestriction,
    ClusterStrategyOptions,
    GenerationOptions
)

As you can see by the amount of imports, the ClusterBasedQuestionGenerator class provides requires many different options which we will explain below. But first, the class interface of the class itself looks as follows:

class ClusterBasedQuestionGenerator(ClusteringStrategy):
    def __init__(self,
                 graph: KnowledgeGraph, 
                 llm_adapter: LLMAdapter,
                 cluster_options: ClusterStrategyOptions,
                 generator_options: ClusterGeneratorOptions):

The parameters are:

  • graph: The knowledge graph to be used for the generation.
  • llm_adapter*: The LLM adapter which generates the question based on the contexts and the answer.
  • cluster_options: The options for the clustering strategy.
  • generator_options: The options for the question generation strategy.

To generate questions, you need to instantiate the ClusterBasedQuestionGenerator class with the above parameters and then run the generate() method. For example:

qa_pairs = ClusterBasedQuestionGenerator(
    graph=graph,
    llm_adapter=gpt_4o_mini,
    generator_options=ClusterGeneratorOptions(
        generation_options=GenerationOptions(
            template_text="Which publications, ranked in descending order of their publication year, have [author name] as an author have and have evaluation method [evaluation_method_name]?",
            additional_requirements=[
                "The context should only include the triples that contain the evaluation methods of the paper",
                "The answer should be a list of publication titles in chronological order",
                "Ensure that the list is ordered correctly based on the publication year in descending order"
            ],
            convert_path_to_text=False,
            validate_contexts=False,
            classify_questions=False,
        ),
        additional_restrictions=[
            AdditionalInformationRestriction(
                information_predicate="Evaluation method",
                split_clusters=True
            ),            
            AdditionalInformationRestriction(
                information_predicate="publication year",
            )
        ]
    ),
    cluster_options=ClusterStrategyOptions(
        topic_entity=research_field,
        restriction_type="R659055",
        restriction_text="authors",
        cluster_eps=0.1,
        cluster_metric="cosine",
        cluster_emb_config=embedding_config,
        soft_limit_qa_pairs=10,
        golden_triple_limit=10,
        enable_caching=False
    )
).generate()

print_qa_pairs(qa_pairs)

As you can see in this example, our intention is to create a question that asks for the publication titles of all publications that have a specific author and evaluation method. To realize this, we first filter the publications based on a predicate R659055 which corresponds to the research field in which the papers of interest are stored. Then we further filter by the predicate authors. The strategy will collect all author names and then conduct a DBSCAN clustering based on the cosine similarity of the embeddings of the author names. This allows to aggregate the publications of the same author into a single cluster.

The question should include two more restrictions which we will add as AdditionalInformationRestriction objects. The first restriction is based on the evaluation method and requires the strategy to now filter all the publications that are from the same author to collect all evaluation methods. Splitting is enabled which means that the publications are split by the evaluation method. As a result, we now have clusters, where all publications inside of the cluster correspond to the same author and have the same evaluation method. Now we further filter the clusters by the publication year. This is done by adding a second AdditionalInformationRestriction object that checks for the publication year.

In the following, we explain the parameters of the ClusterBasedQuestionGenerator class in more detail.

Parameter options

The strategies require to set a number of parameters to configure the generation process. In the following, each options class is explained.

Cluster Strategy Options

The clustering strategy requires two types of options: ClusterStrategyOptions and ClusterGeneratorOptions. Let's look at each of the options in detail.

The ClusterStrategyOptions is instantiated with the following parameters:

  • restriction_type: This is a string which corresponds to a predicate type in the graph. Based on this given type, all publications from the graph are retrieved that include such a predicate. These are the initial publications. This can for example be used to retrieve all publications that correspond to a specific research field or author.
  • restriction_text: This is another string that also corresponds to a predicate type in the graph. This is used to further filter the initial publications. Each subgraph of the initial publications is looked through to check whether the predicate type is present. If it is, the publication is kept else the publication is removed.
  • restriction_value: (Optional) This is a string or a list of strings that directly corresponds to the restriction_text. It is used when a predicate type is found to check the value of the triple in which the predicate type is present. If the value is not equal to the restriction_value, the publication is removed. This can for example be used to remove all publications that do not correspond to a specific author or authors.
  • cluster_eps: (Optional) This is a float value to apply a DBSCAN clustering algorithm to the subgraphs using the values of the triples that conform to the restriction_text. The value of the parameter is the epsilon value of the DBSCAN algorithm.
  • cluster_metric: (Optional) This is a string that corresponds to the metric used for the DBSCAN clustering algorithm. The default value is cosine but it can also be set to euclidean or manhattan.
  • cluster_emb_config: (Optional) This is an Embedding Configuration that is used to create the embedding model for the DBSCAN clustering algorithm. It is based on the EmbeddingConfig class for which we have shown an example above.
  • skip_similarity_clustering: This is a boolean value to skip the similarity clustering essentially disabling the additional clustering using the DBSCAN algorithm.
  • golden_triple_limit: This is an integer value that limits the number of triples that a single question-answer pair can have. Essentially, if the generated pair has more than this number of triples, it is discarded. This
  • golden_triple_minimum: This is an integer value that limits the minimum number of triples that a single question-answer pair must have. Essentially, if the generated pair has less than this number of triples, it is discarded. This can be used to remove all question-answer pairs that are too simple or too complex.
  • soft_limit_qa_pairs: This is an integer value that limits the number of question-answer pairs that can be generated. If the limit is reached, the generation process stops. It is a soft limit as the LLM might generate more questions in a single call before the limit is checked.
  • topic_entity: (Optional) This is a Knowledge entity from the graph that is used as an entry point into the graph. It is not required by the generation strategy but is later provided with the question-answer pair to help the retriever if needed.
  • topic_entity_description: (Optional) This is a string that describes the topic entity. This description will be incorporated into the question after the generation process.
  • skip_clusters_with_only_one_root: This is a boolean value. When true, all clusters are skipped that only contain one publication. This is useful to ensure that the generated question is not based on a single publication but rather on a cluster of publications.
  • enable_caching: This is a boolean value that enables caching of the subgraphs that are retrieved as part of the generation process. This is useful to speed up the generation process for subsequent calls.
  • use_predicate_as_value: In some cases, the predicate text is more meaningful than the associated value of the object in the RDF triple. For example (Paper Title, hasUsedGuidelines, True), in this case, the object only contains a boolean value which is not very meaningful. But the predicate exhibits more meaning which can instead be used.
  • limit_restrictions: (Optional) Each triple in a cluster is referred to as a restriction which is later used as a golden triple. If the number of restrictions is greater than the limit, the cluster is skipped. This is useful to ensure that the generated question does not contain too many golden triples.

Cluster Generator Options

The ClusterGeneratorOptions is instantiated with the following parameters:

  • generation_options: A GenerationOptions object that contains the options for the generation process. This is explained in more detail below.
  • additional_restrictions: A list of AdditionalInformationRestriction objects that are used to further filter the clusters by additional restrictions. The order of the restrictions is important as the first restriction is applied first and so on.
  • only_use_cluster_with_most_triples: This is a boolean value. If enabled, only the cluster with the most triples is used for the generation process.
  • only_use_cluster_with_least_triples: This is a boolean value. If enabled, only the cluster with the least triples is used for the generation process.

Generation Options

The GenerationOptions is instantiated with the following parameters:

  • additional_requirements: A list of strings that are appended to the generation prompt of the LLM. This can be used to provide additional context or instructions to the LLM to guide the generation process.
  • template_text: (Optional) A string that is passed with the generation prompt to the LLM. The LLm is instructed to generate a question based on the template text. This should be a natural language question with placeholders or can also be an instruction.
  • validate_contexts: A boolean value that enables validation of the contexts. If enabled, the another LLM call is made that checks whether the question and answer is based on the context. If not, the question is discarded.
  • convert_path_to_text: Whether the triples are converted into a textual representation using an LLM. This can be useful if the LLM struggles to understand the triples.
  • classify_questions: A boolean value that enables classification of the generated questions using a taxonomy. This feature is at the moment not fully implemented yet.

Additional Information Restriction

The AdditionalInformationRestriction is used to further filter the clusters. This filtering is not similarity based but based on keyword matching:

  • information_predicate: A string that corresponds to a predicate type in the graph. The subgraph that is checked is required to include this predicate.
  • information_value_restriction: A string or a list of strings that directly corresponds to the information_predicate. It is used when a predicate type is found to check the value of the triple in which the predicate type is present. If the value is not equal to the information_value_restriction, the publication is removed. This can for example be used to remove all publications that do not correspond to a specific author or authors.
  • information_value_predicate_restriction: A string or a list of strings that checks the triple that contains the predicate directly (as opposed to traversing to the leaf node as it is done with the information_value_restriction). This can be used to check the value of the predicate itself.
  • split_clusters: When applying the restriction filtering, only those publications are kept that conform to the restrictions. All those publications are aggregated into one cluster. However, if splitting is enabled. Instead of aggregating the values in one cluster, the publications are distributed for each unique restriction value across multiple clusters.

FromTopicEntityGeneratorOptions

The FromTopicEntityGeneratorOptions is instantiated with the following parameters:

  • topic_entity: (Optional) A Knowledge object that represents the topic entity for the question generation.
  • topic_entity_type: (Optional) If the topic entity is not known prior, a entity can be randomly selected. This is the type of entities from the graph that are collected before one is chosen randomly.
  • topic_entity_substring: (Optional) This further restricts the list of collected entities to only those that contain the string as a substring.
  • maximum_subgraph_size: The maximum size that the subgraph can have when provided to the LLM. If the subgraph is larger than this size, it is reduced to this size. Be aware that this might remove interesting information from the subgraph as the reduction is based on randomness.
⚠️ **GitHub.com Fallback** ⚠️