Blog Ontology Simplification for LLM - statnett/Talk2PowerSystem GitHub Wiki

LLM's short term memory (token space) is limited, so a problem that often needs to be tackled is how to simplify an ontology to present it to an LLM. Even if it all fits in memory, long and badly presented ontologies run a high risk of the LLM getting confused or forgetting things in the middle. The shorter and "prettier" a schema is, the better its chance to be understood and used correctly in query answering.

In this blog we share our experience with:

  • An ERAbot for ERA KG of the European Agency for Railways. It uses the standard GraphDB Talk to Your Graph functionality (TTYG)
  • The Talk2PowerSystem project for Statnett, involving CIM/CGMES electrical data. It uses a custom chatbot with custom tools; periodically we take learnings from this project and add them to the GraphDB product

GraphDB TTYG: ERAbot

When setting up an ERAbot by using the GraphDB Talk to Your Graph (TTYG) at https://rail.sandbox.ontotext.com/sparql , We loaded the ERA Vocabulary 3.0.1 (20240618) to a named graph https://data-interop.era.europa.eu/era-vocabulary/v3-20240618/ontology.ttl and set up the bot as follows:

But we got this error:

The size of all agent instructions must fit into 256,000 characters but was 256,597 characters.

Shorten Descriptions

Let's cut ontology descriptions (rdfs:comment) at some fixed length, but after a space.

Let's first find the distribution of lengths (in buckets of N chars):

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?len (count(*) as ?c) where {
    bind(50 as ?bucket)
    ?s rdfs:comment ?def
    bind(floor(strlen(?def)/?bucket)*?bucket as ?len)
} group by ?len order by ?len
len c
0.0 86
50.0 180
100.0 92
150.0 67
200.0 101
250.0 18
300.0 11
350.0 17
400.0 10
450.0 2
500.0 7
550.0 1
600.0 2
650.0 4
850.0 1
1500.0 1
3000.0 1

A good cut-point is at max 400 chars: this leaves most descriptions intact, yet shortens the excessively long ones.

We need to run an Update query. But before that, it's always prudent to run a Select query to test that the results are as expected:

# WRONG

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select *
where {{graph ?g {?x rdfs:comment ?old
  filter(strlen(?old)>400)
  bind(replace(?old,"(.{400}.*?) .*","$1") as ?new)
}}}

This is a complex query that needs some explanation:

  • I've loaded the ontology into a graph (that's the easiest way to feed it to TTYG), so we need to select from a graph, leading to all these brackets {{ ... }}}
  • The filter considers only descriptions longer than 400 chars: else leave it alone
  • The replace function takes a regex as first parameter.
    • Initially I used this one: "(.{400}.*?) .*", which means:
      • Find 400 chars: .{400}
      • Followed by a small number of chars (but not a space): .*? is a non-greedy qualifier, so it will find as few as possible
      • Capture them in a group (...)
      • Followed by a space and then any number of chars: .*. Don't capture them in a group, effectively discarding them
      • Replace with $1, which means with the first (and only) group
    • But running the above test query showed that the regex is wrong
      • The reason is that . in SPARQL regexes means "any char except newline"
      • So I had to replace the first . with (.|\\n)
      • I also replaced space with ( |\\n), i.e. "whitespace". Perhaps a better way to write this is \\s (which also includes tab, linefeed, etc) but I'm not sure whether SPARQL regexes support it
      • The double backslash is a SPARQL string escape: it results in \n which indicates newline in a regex
      • I cannot use a character class [.\n] because . is not a wildcard in [...] but a literal .

So the corrected test query is like this:

# RIGHT

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select *
where {{graph ?g {?x rdfs:comment ?old
  filter(strlen(?old)>400)
  bind(replace(?old,"((.|\\n){400}.*?)( |\\n)(.|\\n)*","$1") as ?new)
}}}

Examining the results shows they are correct.

For example, the description:

The index of a vocabulary term in Appendix D2 Elements the infrastructure manager has to provide to the railway undertaking for the Route Book from the document Commission Implementing Regulation (EU) 2019/773 of 16 May 2019 on the technical specification for interoperability relating to the operation and traffic management subsystem of the rail system within the European Union and repealing Decision 2012/757/EU.

Is cut out at the indicated text 2012/757/EU. That particular regulation number is of little use to the LLM to improve its understanding of the ontology.

However, some descriptions are left at longer than 400 chars. Eg this one:

Indicates whether a RINF parameter is used in Route Compatibility Check calculations according to Commission Implementing Regulation (EU) 2019/773 of 16 May 2019 on the technical specification for interoperability relating to the operation and traffic management subsystem of the rail system within the European Union and repealing Decision. https://eur-lex.europa.eu/eli/reg_impl/2019/773/oj#:~:text=Commission%20Implementing%20Regulation.

Is not shortened at all because the cutoff falls at text= and there's no space in the rest of the description. This is just as well, since it leaves the link intact, so the LLM can give it back to the user if asked about it.

With all this said, we're now ready to shorten the descriptions with this SPARQL update:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
delete {graph ?g {?x rdfs:comment ?old}}
insert {graph ?g {?x rdfs:comment ?new}}
where {{graph ?g {?x rdfs:comment ?old
  filter(strlen(?old)>400)
  bind(replace(?old,"((.|\\n){400}(.|\\n)*?) (.|\\n)*","$1") as ?new)
}}}

I like to always align the delete, insert and where (select) patterns, so it's immediately obvious what is being replaced. And to call values being deleted ?old, and values being inserted ?new.

Such simple discipline makes Updates easier to understand for the reader.

Delete Inessential Bookkeeping Info

The ERA ontology includes annotations stating which term came from which clause of the appendices of the RINF regulation, and what is the corresponding XML element name. These are not needed for querying, so we can get rid of them:

prefix : <http://data.europa.eu/949/>
delete where {graph ?g {?x :XMLName ?y}};
delete where {graph ?g {?x :appendixD2Index ?y}};
delete where {graph ?g {?x :appendixD3Index ?y}};
delete where {graph ?g {?x :rinfIndex       ?y}};

We also delete dct:created, dct:modified that state when a term was created and last updated. However, we better be careful to do this only for ontology terms, since these props are often used with instance data as well.

prefix dct: <http://purl.org/dc/terms/>
delete where {graph ?g {?x dct:created  ?y}};
delete where {graph ?g {?x dct:modified ?y}};

Unfortunately the previous section "Shorten Descriptions" does not shorten the ontology sufficiently, so we had to delete all descriptions:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
delete where {graph ?g {?x rdfs:comment ?y}};

Simplify External Terms

The ERA ontology includes external terms (terms reused from other ontologies), with all their accoutrements. Which sometimes include long descriptions in 6-7 languages, including Japanese. For example:

<http://www.w3.org/ns/org#identifier> rdfs:comment
  "組織を一意に識別するために使用できる会社登録番号などの識別子を与えます。"@ja,
  "Código o identificador, como por ejemplo el CIF de una empresa, que permite identificar de forma inequívoca a una organización. Existen muchos códigos de identificación tanto nacionales como internacionales. Esta ontología no obliga al uso de ningún esquema en concreto. Los códigos de identificación utilizados en cada caso se deberían indicar mediante el uso"@es,
  "Donne un identifiant, comme par exemple le numéro d'enregistrement d'une entreprise, qui peut être utilisé comme identifiant unique pour l'Organisation. De nombreux schémas nationaux et internationaux sont disponibles. Cette ontologie reste neutre par rapport au schéma utilisé. Le schéma particulier utilisé devrait être indiqué par le `datatype` de la valeur"@fr,

We could delete all descriptions of external terms, or only the non-English descriptions as here:

PREFIX era: <http://data.europa.eu/949/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
delete {graph ?g {?x rdfs:comment ?old}}
where {{graph ?g {?x rdfs:comment ?old
  filter(!strstarts(str(?x),str(era:)))
  filter(lang(?old) not in ("en",""))
}}}

The filter !strstarts(str(?x),str(era:)) says "URLs that are not in the era: namespace".

Add Missing Prefixes

The ontology includes a large number of triples like this:

:eddyCurrentBrakingConditionsDocument 
  rdfs:isDefinedBy <http://data.europa.eu/949/>;
  <http://www.w3.org/2003/06/sw-vocab-status/ns#term_status> "stable" .

They don't use prefixes, which lengthens the ontology unnecessarily.

We can add prefixes with an empty insert data update: then they are recorded as repository namespaces, and used on export:

prefix era:  <http://data.europa.eu/949/>
prefix vs:   <http://www.w3.org/2003/06/sw-vocab-status/ns#>
prefix dc:   <http://purl.org/dc/elements/1.1/>
prefix geo:  <http://www.opengis.net/ont/geosparql#>
prefix org:  <http://www.w3.org/ns/org#>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
insert data {}

Notes:

  • I've added a few more prefixes for external terms.
  • The ontology uses the empty prefix : instead of era:. We hoped that by defining the latter prefix, the provenance statements will be shortened
# from 
rdfs:isDefinedBy <http://data.europa.eu/949/>;
# to
rdfs:isDefinedBy era:

But for some reason, this doesn't happen.

Custom Chatbot: Talk2PowerSystem

In this section we describe improvements that we have implemented in the Talk2PowerSystem project for Statnett.

Improve Ontology Presentation

If you do Examine agent instructions in ERA TTYG, the ontology looks like this:

<http://data.europa.eu/949/> a owl:Ontology .

# Much later
<http://data.europa.eu/949/> owl:imports <http://www.opengis.net/ont/geosparql#>;
  cc:license <https://creativecommons.org/licenses/by/4.0/>;
  <http://purl.org/dc/elements/1.1/contributor> "Dragos Patru, (ERA)", "Ghislain Atemezing, (ERA)",
    "Maarten Duhoux, (ERA)", "Marina Aguado, (ERA)", "Polymnia Vasilopoulou, (ERA)";
  <http://purl.org/dc/elements/1.1/creator> "Edna Ruckhaus, (UPM)", "Oscar Corcho, (UPM)".

# Much later
<http://data.europa.eu/949/> dct:publisher "European Union Agency for Railways";
  dct:title "ERA Ontology"@en;
  rdfs:comment "This is the human and machine readable Ontology governed by the European Union Agency for Railways (https://www.era.europa.eu/). It represents the concepts and relationships linked to the sectorial legal framework and the use cases under the Agency´s remit, as described in the Commission Implementing Regulation (EU) [to be updated after publication] on the common specifications for the register of"@en;
  # notice the cutting in the last sentence
  rdfs:label "ERA Ontology"@en .
  
_:node448 a owl:Class .

:minimumConcaveVerticalRadius a owl:FunctionalProperty, owl:DatatypeProperty .

:hasAutomaticDroppingDevice rdfs:domain :VehicleType;
  rdfs:range xsd:boolean .

:SubsetWithCommonCharacteristics dct:modified "2023-03-14"^^xsd:date .

:Track dct:modified "2022-07-07"^^xsd:date .

This has numerous defects, which make the ontology hard to understand for humans and LLMs alike:

  • The description of an ontology term is often split into several distant blocks
  • Terms are not ordered: neither alphabetically, nor in logical groups (Classes, Properties, Individuals)
  • OWL Restrictions are presented using Blank nodes

For the purpose of electrical ontology (CIM/CGMES and NC) optimizations, we have evaluated a number of Turtle serialization tools, see Inst4CIM-KG#turtle-serialization. The best we found in Java is the turtle-formatter library by Andreas Textor, and the owl-cli tool (owl-cli-snapshot.jar), see usage guide of the write-command. It has a number of useful features and you can see as an example the pretty-printed EQ profile of the electrical CIM: 61970-600-2_Equipment-AP-Voc-RDFS2020_v3-0-0.ttl. We posted a GraphDB enhancement request for TTYG to use the same formatting techniques.

Subset the Ontology

CIM/CGMES+NC is a very large ontology: 900 classes and about 6000 props. So for Talk2PowerSystem, we subset the ontology to the terms (classes, props and enumeration values) actually used in the data. We use an ontology-query.rq that goes something like this:

PREFIX uml:  <http://iec.ch/TC57/NonStandard/UML#>
PREFIX cims: <http://iec.ch/TC57/1999/rdf-schema-extensions-19990926#>

SELECT DISTINCT ?x 
{
  {         # properties
    [] ?x []
  } UNION { # classes
    [] a ?x
  } UNION { # enumeration values
    ?x a ?enum.
    ?enum cims:stereotype uml:enumeration
    FILTER EXISTS {
      ?thing a cim:IdentifiedObject; ?prop ?x
    }
  }
}

It also uses tricks such as federated querying to the same repo to:

  • Use inference while discovering terms to include
  • Not use inference while describing those terms
  • Shortening descriptions
  • Etc, described under task Ontology subsetting and presentation to LLM (this link is not public)

You can see the current result in cim-subset-pretty.ttl

Convert to SOML and SOML-Simple

In the Talk2PowerSystem project we'll not only try NLQ using SPARQL generation, but also GraphQL generation. GraphQL in the Ontotext Platform was controlled by Semantic Object Modeling Language

In GraphDB 11, GraphQL is rolled into the database, and is controlled by a variety of schemas (ontologies and shapes) expressed RDF: see GraphQL Schema Generation and GraphQL Migration. You can generate such a schema from SOML.

The open source repository https://github.com/vladimirAlexiev/soml includes some tools for working with SOML, including

  • Generate SOML schema from RDFS/OWL/Schema ontologies (owl2soml)
  • Simplify a SOML entity schema significantly, so it can be communicated more easily to LLM for querying (soml-simplify)

Take for example these 2 classes and 2 properties from the OP profile:

cim:AccumulatorReset a owl:Class ;
  rdfs:label "AccumulatorReset" ;
  rdfs:comment "This command resets the counter value to zero."@en ;
  cims:belongsToCategory op:Package_OperationProfile ;
  cims:stereotype uml:concrete ;
  rdfs:subClassOf cim:Control .

cim:Control a owl:Class ;
  rdfs:label "Control" ;
  rdfs:comment "Control is used for supervisory/device control. It represents control outputs that are used to change the state in a process, e.g. close or open breaker, a set point value or a raise lower command."@en ;
  cims:belongsToCategory op:Package_OperationProfile ;
  rdfs:subClassOf cim:IOPoint .

cim:AccumulatorReset.AccumulatorValue a owl:ObjectProperty, owl:FunctionalProperty, owl:InverseFunctionalProperty ;
  rdfs:label "AccumulatorValue" ;
  rdfs:comment "The accumulator value that is reset by the command."@en ;
  cims:AssociationUsed "Yes" ;
  cims:multiplicity cims:M:1..1 ;
  owl:inverseOf cim:AccumulatorValue.AccumulatorReset ;
  rdfs:domain cim:AccumulatorReset ;
  rdfs:range cim:AccumulatorValue .

cim:Control.PowerSystemResource a owl:ObjectProperty, owl:FunctionalProperty ;
  rdfs:label "PowerSystemResource" ;
  rdfs:comment "Regulating device governed by this control output."@en ;
  cims:AssociationUsed "Yes" ;
  cims:multiplicity cims:M:0..1 ;
  owl:inverseOf cim:PowerSystemResource.Controls ;
  rdfs:domain cim:Control ;
  rdfs:range cim:PowerSystemResource .

The are translated to the following SOML:

objects:
  AccumulatorReset:
    descr: This command reset the counter value to zero
    inherits: ControlInterface
    label: AccumulatorReset
    props:
      accumulatorReset.AccumulatorValue: {}
    type: cim:AccumulatorReset
  ControlInterface:
    descr: Abstract superclass of Control
    inherits: IdentifiedObjectInterface
    kind: abstract
    props:
      control.PowerSystemResource: {}
properties:
  accumulatorReset.AccumulatorValue:
    descr: The accumulator value that is reset by the command
    inverseOf: accumulatorValue.AccumulatorReset
    kind: object
    label: AccumulatorValue
    max: 1
    min: 1
    range: AccumulatorValue
    rdfProp: cim:AccumulatorReset.AccumulatorValue
  control.PowerSystemResource:
    descr: 'The controller outputs used to...'
    inverseOf: powerSystemResource.Controls
    kind: object
    label: PowerSystemResource
    max: inf
    min: 0
    range: PowerSystemResourceInterface
    rdfProp: cim:Control.PowerSystemResource

soml-simplifty leaves only the bare essentials needed for querying:

AccumulatorReset:
  ISA: ControlInterface
  accumulatorReset.AccumulatorValue: AccumulatorValue
ControlInterface:
  ISA: IdentifiedObjectInterface
  control.PowerSystemResource: [PowerSystemResourceInterface]

Slide 16 of the presentation Talk2PowerSystem: Democratizing power system analytics (Vladimir Alexiev and Svein Harald Olsen, CIM IEC TC57 meeting, EDF R&D, Paris-Saclay, 6 June 2025), shows a size comparison of different schemas of a subset of CIM/CGMES 16 (Equipment*+Geography*+model):

  • turtle 1.56M
  • SOML 260k (16x shorter)
  • simplified 37k (42x shorter)

While we have used SOML-simplified for GraphQL querying, preliminary experiments show that it is also useful for SPARQL querying. But more work is needed to put this in production.

You can see some examples in owl2soml/eg, eg: