KGQA Corpus Generation - statnett/Talk2PowerSystem GitHub Wiki

Introduction

This page is dedicated to describing the process for generating a Knowledge Graph Question Answering corpus (KGQA corpus). The goal is for this process to be largely automated but flexible enough to support all user stories. This means that it is unlikely to be fully automated within the scope of the project. The goal is to make regenerating and extending the KGQA corpus quick and easy but some oversight will be required at all times.

The full KGQA generation process takes place over two distinct steps. First is the Template Generation that uses the ontology and some statistics from the KG to produce a variety of interesting templates based on valid graph patterns. Second is the Question Instantiation that uses the output of the previous step combined with manually added templates (more complex) and access to the KG to generate question-answer pairs.

We'll look at the details of each step but it is important to understand that the final result of the KGQA corpus generation process is a KGQA dataset. This dataset contains both the templates and specific questions based on each template. Each question in it should be answerable by the chatbot. This final dataset can be used for both evaluation or providing hints with the n-shot tool and the resulting dataset is large enough that it can be split to do both.

Current Limitations

There are several limitations of the current generation process. Most will be addressed over the course of the process depending on priorities.

Multiple outputs

A few of the questions identified in user stories require more than one output variable but most ask for only a single output variable. Currently, the template building step only produces templates for a single output. The question instantiation step can support multiple outputs and for now multiple output questions are added with the manual templates.

There is a huge variety in possible multiple output questions. They will gradually be added to the template generation step as well once they are clarified.

Time series data

Currently there are no questions related to time series data. Support will be added for the purpose of evaluating chatbot performance after it is capable of querying time series.

Geographical data

A few of the target questions reference working with geographical regions specified at query time. Currently, the process works with all predefined regions but does not include GeoSPARQL.

Dataset format and metadata

Currently the dataset is a yaml file with a list of templates and questions in each template. Eventually we plan to store the dataset in GraphDB with metadata about it's generation time and parameters.

Template Generation

The template generation process is maximally generic. It's primary goal is to cover as much of the ontology with valid question templates as possible. There is a certain trade off in coverage versus complexity here in that generating all very complex graph patterns will lead to an explosion in templates beyond our ability to handle. There are reasonable limits placed on it for the moment which are discussed in the section on pattern exploration.

Output

A current version with the full output has been added to the Talk2PowerSystem_LLM project.

Pattern Exploration

Pattern exploration starts from a given kind of node e.g. cim:Line or cim:Switch and starts generating list templates for it.

First, it generates the simplest pattern possible- simple lists of objects of that class e.g. list all lines, list all switches.

Then it identifies all direct parameters and filters of the class from the ontology and dataset statistics. It generates templates using any one of them e.g. "list all lines in {$region}", "list all switches that are normally {$Switch.normalOpen}"

Then it identifies multi-hop parameters and filters by following "important" (heuristic) connecting nodes. It generates templates using any one of them e.g. "list all lines in the same region as {$line}"

Then it generates combinations of any of the above parameters and filters e.g. "list all breakers in {$terminal} that are normally {$Switch.normallyOpen}", "Which lines in {$Line.Region/SubGeographicRegion.Region} have part {$ACLineSegment}?"

The latest generated template dataset is uploaded in the Talk2PowerSystem_LLM project and it has 486 templates. It has the following limitations:

Only based on 30 classes out of 100+
Only combines up to two parameters and filters in the same template
Only go up to two hops
Only hop over the 3 most common properties

Parameter vs Filter

The distinction between parameter and filter is somewhat technical. A parameter is a URI. A filter is a literal. This is important for a few reasons

Parameters could be followed further
The way parameters and filters are referred to in natural language questions is different
- Parameters can be referred to as name, full mrid, significant mrid, by type + one of the above, by partial name
- Filters need to be a value that exists in the dataset
- In the future, filters might be comparative (less than, in range, etc.) which also is selected from the dataset

Multiple hops

The heuristic here is based on counts of connections between classes in the dataset. We look up the N most common properties leading out of the current node and we jump up to M-1 times. So:

1-hop is combinations of all direct parameters and filters i.e. basic
2-hop is combinations of all direct and three most common adjacent parameters and filters
3-hop is combinations of all direct, N adjacent and (N adjacent)*(N adjacent) parameters and filters

It should be pretty clear that increasing the number of hops indiscriminately will quickly lead to an explosion in the number of templates and some truly bizarre question which is why we are limiting to just 2-hop for only 3 properties for now.

Question Instantiation

The workflow from question instantiation is entirely separated from the template generation workflow. In fact, in order to support more complex templates than are currently produced by the template generation workflow, it is assumed that some especially complex manually created templates are added to the input here. This allows us to support all identified questions/user stories while working on extending template generation.

It is important to understand the steps of the process, their limitations and steps intended for future improvement.

Input Template

Simple:

- connecting_properties: []
  description: A graph pattern which lists all values for line limited by parameters
    for none and by filter properties none . This pattern uses connecting properties
    none.
  filters: []
  name: List line by parameters none and filters none
  outputs:
  - cim:Line
  params: []
  paraphrases:
  - Which entities are classified as lines?
  - List all objects that are lines.
  - What are the available lines?
  - Provide the list of equipment classified as lines.
  - Identify all lines in the system.
  questions: []
  sparql_template: ?line a cim:Line .
  template_id: template_list_a3bd5ce9356c17e72ecd1f7336da3823

Complex:

- connecting_properties:
  - https://cim.ucaiug.io/ns#GeneratingUnit.RotatingMachine
  description: A graph pattern which lists all values for generatingunit limited by
    parameters for rotatingmachine and by filter properties generatingunit.maxoperatingp
    . This pattern uses connecting properties generatingunit.rotatingmachine.
  filters:
  - !!python/tuple
    - cim:GeneratingUnit
    - https://cim.ucaiug.io/ns#GeneratingUnit.maxOperatingP
    - xsdfloat
  name: List generatingunit by parameters rotatingmachine and filters generatingunit.maxoperatingp
  outputs:
  - cim:GeneratingUnit
  params:
  - https://cim.ucaiug.io/ns#RotatingMachine
  paraphrases:
  - Which generating units with a rotating machine $ObjectIdentity(0, cim:RotatingMachine)
    have a maximum operating power of $ValueFilter(cim:GeneratingUnit, cim:GeneratingUnit.maxOperatingP,
    xsdfloat) Megawatts?
  - List the generating units connected to the rotating machine $ObjectIdentity(0,
    cim:RotatingMachine) that have a max operating power of $ValueFilter(cim:GeneratingUnit,
    cim:GeneratingUnit.maxOperatingP, xsdfloat).
  - What are the generating units with a max operating power of $ValueFilter(cim:GeneratingUnit,
    cim:GeneratingUnit.maxOperatingP, xsdfloat) MW and associated with the rotating
    machine $ObjectIdentity(0, cim:RotatingMachine)?
  - Retrieve the generating units with $ValueFilter(cim:GeneratingUnit, cim:GeneratingUnit.maxOperatingP,
    xsdfloat) maximum operating active power and connected to the rotating machine
    $ObjectIdentity(0, cim:RotatingMachine).
  - Which generating units linked to the rotating machine $ObjectIdentity(0, cim:RotatingMachine)
    have a maximum operating power of $ValueFilter(cim:GeneratingUnit, cim:GeneratingUnit.maxOperatingP,
    xsdfloat) MW?
  questions: []
  sparql_template: '?generatingunit a cim:GeneratingUnit ;
    cim:GeneratingUnit.RotatingMachine {$ObjectIdentity(0, cim:RotatingMachine)} ;
    cim:GeneratingUnit.maxOperatingP {$ValueFilter(cim:GeneratingUnit, cim:GeneratingUnit.maxOperatingP,
    xsdfloat)} .
    {$ObjectIdentity(0, cim:RotatingMachine)} a cim:RotatingMachine ;
    .'
  template_id: template_list_a312538fdacd1e0e8e3fe1b02c927408

Parameter set identification and selection

In this step, we search the KG to find valid combinations of parameter and filter values that will return some results then we select some subset to generate questions over.

In our previous examples, the simple template has no parameters or filters and therefore there is only one version of that question we can generate. The complex template, however, has a parameter and a filter. The system will run the following SPARQL query

PREFIX cim: <https://cim.ucaiug.io/ns#>
select ?param_0 ?param_1 (count(*) as ?cnt)
where {
    ?generatingunit a cim:GeneratingUnit ;
    cim:GeneratingUnit.RotatingMachine ?param_0 ;
    cim:GeneratingUnit.maxOperatingP ?param_1 .
    ?param_0 a cim:RotatingMachine ;
    .
} group by ?param_0 ?param_1 order by desc(?cnt)

which returns 80 combinations of RotatingMachine URI and maxOperatingP value that identifies one generator. We will select as many out of these 80 as we need for our target questions per template and then generate questions that use those values.

Parameter naming and NL questions

Here we choose one of the available paraphrases and replace the parameters in them with actual values.

In our simple case, there are no parameters so we'll just select one of the paraphrases. In our complex case, we have to name both a filter and a parameter. The filter will be replaced by an exact value (e.g. 1183) in the natural language question. The parameter will be referred to randomly using one of name, mrid, and significant mrid (if avaialable). So possible final natural language questions are things like:

Which generating units with a rotating machine f1769a2e have a maximum operating power of 1183 Megawatts?
What are the generating units with a max operating power of 1230 MW and associated with the rotating machine GRUNDFOR420 M3?
Retrieve the generating units with 900 maximum operating active power and connected to the rotating machine f1769a0a-9aeb-11e5-91da-b8763fd99c5f.

Question queries and expected answers

We have two options on how to compare SPARQL querying- compare the queries themselves or compare the results they produce. For now evaluation focuses on comparing results but we record both in the KGQA corpus.

So in the complex case we'd make the query:

select ?generatingunit
where {
    ?generatingunit a cim:GeneratingUnit ;
    cim:GeneratingUnit.RotatingMachine <urn:uuid:f1769a2e-9aeb-11e5-91da-b8763fd99c5f> ;
    cim:GeneratingUnit.maxOperatingP "1183"^^xsd:float .
    <urn:uuid:f1769a2e-9aeb-11e5-91da-b8763fd99c5f> a cim:RotatingMachine ;
    .
}

and the expected results are one important column with one row that contains the value urn:uuid:f1769a2b-9aeb-11e5-91da-b8763fd99c5f.

Output

This is still being finished, the resultant KGQA dataset will be uploaded to the Talk2PowerSystem_LLM project by the end of the week.

Relevant Issues

[1]https://github.com/statnett/Talk2PowerSystem_PM/issues/64

[2]https://github.com/statnett/Talk2PowerSystem_PM/issues/62

[3]https://github.com/statnett/Talk2PowerSystem_PM/issues/63