Blog Ontology Simplification for LLM - statnett/Talk2PowerSystem GitHub Wiki
LLM's short term memory (token space) is limited, so a problem that often needs to be tackled is how to simplify an ontology to present it to an LLM. Even if it all fits in memory, long and badly presented ontologies run a high risk of the LLM getting confused or forgetting things in the middle. The shorter and "prettier" a schema is, the better its chance to be understood and used correctly in query answering.
In this blog we share our experience with:
- An
ERAbot
for ERA KG of the European Agency for Railways. It uses the standard GraphDB Talk to Your Graph functionality (TTYG) - The
Talk2PowerSystem
project for Statnett, involving CIM/CGMES electrical data. It uses a custom chatbot with custom tools; periodically we take learnings from this project and add them to the GraphDB product
GraphDB TTYG: ERAbot
When setting up an ERAbot
by using the GraphDB Talk to Your Graph (TTYG) at https://rail.sandbox.ontotext.com/sparql ,
We loaded the ERA Vocabulary 3.0.1 (20240618) to a named graph https://data-interop.era.europa.eu/era-vocabulary/v3-20240618/ontology.ttl
and set up the bot as follows:
But we got this error:
The size of all agent instructions must fit into 256,000 characters but was 256,597 characters.
Shorten Descriptions
Let's cut ontology descriptions (rdfs:comment
) at some fixed length, but after a space.
Let's first find the distribution of lengths (in buckets of N chars):
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?len (count(*) as ?c) where {
bind(50 as ?bucket)
?s rdfs:comment ?def
bind(floor(strlen(?def)/?bucket)*?bucket as ?len)
} group by ?len order by ?len
len | c |
---|---|
0.0 | 86 |
50.0 | 180 |
100.0 | 92 |
150.0 | 67 |
200.0 | 101 |
250.0 | 18 |
300.0 | 11 |
350.0 | 17 |
400.0 | 10 |
450.0 | 2 |
500.0 | 7 |
550.0 | 1 |
600.0 | 2 |
650.0 | 4 |
850.0 | 1 |
1500.0 | 1 |
3000.0 | 1 |
A good cut-point is at max 400 chars: this leaves most descriptions intact, yet shortens the excessively long ones.
We need to run an Update query. But before that, it's always prudent to run a Select query to test that the results are as expected:
# WRONG
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select *
where {{graph ?g {?x rdfs:comment ?old
filter(strlen(?old)>400)
bind(replace(?old,"(.{400}.*?) .*","$1") as ?new)
}}}
This is a complex query that needs some explanation:
- I've loaded the ontology into a graph (that's the easiest way to feed it to TTYG), so we need to select from a graph, leading to all these brackets
{{ ... }}}
- The
filter
considers only descriptions longer than 400 chars: else leave it alone - The
replace
function takes a regex as first parameter.- Initially I used this one:
"(.{400}.*?) .*"
, which means:- Find 400 chars:
.{400}
- Followed by a small number of chars (but not a space):
.*?
is a non-greedy qualifier, so it will find as few as possible - Capture them in a group
(...)
- Followed by a space and then any number of chars:
.*
. Don't capture them in a group, effectively discarding them - Replace with
$1
, which means with the first (and only) group
- Find 400 chars:
- But running the above test query showed that the regex is wrong
- The reason is that
.
in SPARQL regexes means "any char except newline" - So I had to replace the first
.
with(.|\\n)
- I also replaced space with
( |\\n)
, i.e. "whitespace". Perhaps a better way to write this is\\s
(which also includes tab, linefeed, etc) but I'm not sure whether SPARQL regexes support it - The double backslash is a SPARQL string escape: it results in
\n
which indicates newline in a regex - I cannot use a character class
[.\n]
because.
is not a wildcard in[...]
but a literal.
- The reason is that
- Initially I used this one:
So the corrected test query is like this:
# RIGHT
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select *
where {{graph ?g {?x rdfs:comment ?old
filter(strlen(?old)>400)
bind(replace(?old,"((.|\\n){400}.*?)( |\\n)(.|\\n)*","$1") as ?new)
}}}
Examining the results shows they are correct.
For example, the description:
The index of a vocabulary term in Appendix D2 Elements the infrastructure manager has to provide to the railway undertaking for the Route Book from the document Commission Implementing Regulation (EU) 2019/773 of 16 May 2019 on the technical specification for interoperability relating to the operation and traffic management subsystem of the rail system within the European Union and repealing Decision
2012/757/EU.
Is cut out at the indicated text 2012/757/EU.
That particular regulation number is of little use to the LLM to improve its understanding of the ontology.
However, some descriptions are left at longer than 400 chars. Eg this one:
Indicates whether a RINF parameter is used in Route Compatibility Check calculations according to Commission Implementing Regulation (EU) 2019/773 of 16 May 2019 on the technical specification for interoperability relating to the operation and traffic management subsystem of the rail system within the European Union and repealing Decision. https://eur-lex.europa.eu/eli/reg_impl/2019/773/oj#:~:text=Commission%20Implementing%20Regulation.
Is not shortened at all because the cutoff falls at text=
and there's no space in the rest of the description.
This is just as well, since it leaves the link intact, so the LLM can give it back to the user if asked about it.
With all this said, we're now ready to shorten the descriptions with this SPARQL update:
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
delete {graph ?g {?x rdfs:comment ?old}}
insert {graph ?g {?x rdfs:comment ?new}}
where {{graph ?g {?x rdfs:comment ?old
filter(strlen(?old)>400)
bind(replace(?old,"((.|\\n){400}(.|\\n)*?) (.|\\n)*","$1") as ?new)
}}}
I like to always align the delete, insert
and where
(select) patterns, so it's immediately obvious what is being replaced.
And to call values being deleted ?old
, and values being inserted ?new
.
Such simple discipline makes Updates easier to understand for the reader.
Delete Inessential Bookkeeping Info
The ERA ontology includes annotations stating which term came from which clause of the appendices of the RINF regulation, and what is the corresponding XML element name. These are not needed for querying, so we can get rid of them:
prefix : <http://data.europa.eu/949/>
delete where {graph ?g {?x :XMLName ?y}};
delete where {graph ?g {?x :appendixD2Index ?y}};
delete where {graph ?g {?x :appendixD3Index ?y}};
delete where {graph ?g {?x :rinfIndex ?y}};
We also delete dct:created, dct:modified
that state when a term was created and last updated.
However, we better be careful to do this only for ontology terms, since these props are often used with instance data as well.
prefix dct: <http://purl.org/dc/terms/>
delete where {graph ?g {?x dct:created ?y}};
delete where {graph ?g {?x dct:modified ?y}};
Unfortunately the previous section "Shorten Descriptions" does not shorten the ontology sufficiently, so we had to delete all descriptions:
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
delete where {graph ?g {?x rdfs:comment ?y}};
Simplify External Terms
The ERA ontology includes external terms (terms reused from other ontologies), with all their accoutrements. Which sometimes include long descriptions in 6-7 languages, including Japanese. For example:
<http://www.w3.org/ns/org#identifier> rdfs:comment
"組織を一意に識別するために使用できる会社登録番号などの識別子を与えます。"@ja,
"Código o identificador, como por ejemplo el CIF de una empresa, que permite identificar de forma inequívoca a una organización. Existen muchos códigos de identificación tanto nacionales como internacionales. Esta ontología no obliga al uso de ningún esquema en concreto. Los códigos de identificación utilizados en cada caso se deberían indicar mediante el uso"@es,
"Donne un identifiant, comme par exemple le numéro d'enregistrement d'une entreprise, qui peut être utilisé comme identifiant unique pour l'Organisation. De nombreux schémas nationaux et internationaux sont disponibles. Cette ontologie reste neutre par rapport au schéma utilisé. Le schéma particulier utilisé devrait être indiqué par le `datatype` de la valeur"@fr,
We could delete all descriptions of external terms, or only the non-English descriptions as here:
PREFIX era: <http://data.europa.eu/949/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
delete {graph ?g {?x rdfs:comment ?old}}
where {{graph ?g {?x rdfs:comment ?old
filter(!strstarts(str(?x),str(era:)))
filter(lang(?old) not in ("en",""))
}}}
The filter !strstarts(str(?x),str(era:))
says "URLs that are not in the era:
namespace".
Add Missing Prefixes
The ontology includes a large number of triples like this:
:eddyCurrentBrakingConditionsDocument
rdfs:isDefinedBy <http://data.europa.eu/949/>;
<http://www.w3.org/2003/06/sw-vocab-status/ns#term_status> "stable" .
They don't use prefixes, which lengthens the ontology unnecessarily.
We can add prefixes with an empty insert data
update: then they are recorded as repository namespaces, and used on export:
prefix era: <http://data.europa.eu/949/>
prefix vs: <http://www.w3.org/2003/06/sw-vocab-status/ns#>
prefix dc: <http://purl.org/dc/elements/1.1/>
prefix geo: <http://www.opengis.net/ont/geosparql#>
prefix org: <http://www.w3.org/ns/org#>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
insert data {}
Notes:
- I've added a few more prefixes for external terms.
- The ontology uses the empty prefix
:
instead ofera:
. We hoped that by defining the latter prefix, the provenance statements will be shortened
# from
rdfs:isDefinedBy <http://data.europa.eu/949/>;
# to
rdfs:isDefinedBy era:
But for some reason, this doesn't happen.
Custom Chatbot: Talk2PowerSystem
In this section we describe improvements that we have implemented in the Talk2PowerSystem project for Statnett.
Improve Ontology Presentation
If you do Examine agent instructions
in ERA TTYG, the ontology looks like this:
<http://data.europa.eu/949/> a owl:Ontology .
# Much later
<http://data.europa.eu/949/> owl:imports <http://www.opengis.net/ont/geosparql#>;
cc:license <https://creativecommons.org/licenses/by/4.0/>;
<http://purl.org/dc/elements/1.1/contributor> "Dragos Patru, (ERA)", "Ghislain Atemezing, (ERA)",
"Maarten Duhoux, (ERA)", "Marina Aguado, (ERA)", "Polymnia Vasilopoulou, (ERA)";
<http://purl.org/dc/elements/1.1/creator> "Edna Ruckhaus, (UPM)", "Oscar Corcho, (UPM)".
# Much later
<http://data.europa.eu/949/> dct:publisher "European Union Agency for Railways";
dct:title "ERA Ontology"@en;
rdfs:comment "This is the human and machine readable Ontology governed by the European Union Agency for Railways (https://www.era.europa.eu/). It represents the concepts and relationships linked to the sectorial legal framework and the use cases under the Agency´s remit, as described in the Commission Implementing Regulation (EU) [to be updated after publication] on the common specifications for the register of"@en;
# notice the cutting in the last sentence
rdfs:label "ERA Ontology"@en .
_:node448 a owl:Class .
:minimumConcaveVerticalRadius a owl:FunctionalProperty, owl:DatatypeProperty .
:hasAutomaticDroppingDevice rdfs:domain :VehicleType;
rdfs:range xsd:boolean .
:SubsetWithCommonCharacteristics dct:modified "2023-03-14"^^xsd:date .
:Track dct:modified "2022-07-07"^^xsd:date .
This has numerous defects, which make the ontology hard to understand for humans and LLMs alike:
- The description of an ontology term is often split into several distant blocks
- Terms are not ordered: neither alphabetically, nor in logical groups (Classes, Properties, Individuals)
- OWL Restrictions are presented using Blank nodes
For the purpose of electrical ontology (CIM/CGMES and NC) optimizations, we have evaluated a number of Turtle serialization tools, see Inst4CIM-KG#turtle-serialization. The best we found in Java is the turtle-formatter library by Andreas Textor, and the owl-cli tool (owl-cli-snapshot.jar), see usage guide of the write-command. It has a number of useful features and you can see as an example the pretty-printed EQ profile of the electrical CIM: 61970-600-2_Equipment-AP-Voc-RDFS2020_v3-0-0.ttl. We posted a GraphDB enhancement request for TTYG to use the same formatting techniques.
Subset the Ontology
CIM/CGMES+NC is a very large ontology: 900 classes and about 6000 props. So for Talk2PowerSystem, we subset the ontology to the terms (classes, props and enumeration values) actually used in the data. We use an ontology-query.rq that goes something like this:
PREFIX uml: <http://iec.ch/TC57/NonStandard/UML#>
PREFIX cims: <http://iec.ch/TC57/1999/rdf-schema-extensions-19990926#>
SELECT DISTINCT ?x
{
{ # properties
[] ?x []
} UNION { # classes
[] a ?x
} UNION { # enumeration values
?x a ?enum.
?enum cims:stereotype uml:enumeration
FILTER EXISTS {
?thing a cim:IdentifiedObject; ?prop ?x
}
}
}
It also uses tricks such as federated querying to the same repo to:
- Use inference while discovering terms to include
- Not use inference while describing those terms
- Shortening descriptions
- Etc, described under task Ontology subsetting and presentation to LLM (this link is not public)
You can see the current result in cim-subset-pretty.ttl
Convert to SOML and SOML-Simple
In the Talk2PowerSystem project we'll not only try NLQ using SPARQL generation, but also GraphQL generation. GraphQL in the Ontotext Platform was controlled by Semantic Object Modeling Language
In GraphDB 11, GraphQL is rolled into the database, and is controlled by a variety of schemas (ontologies and shapes) expressed RDF: see GraphQL Schema Generation and GraphQL Migration. You can generate such a schema from SOML.
The open source repository https://github.com/vladimirAlexiev/soml includes some tools for working with SOML, including
- Generate SOML schema from RDFS/OWL/Schema ontologies (owl2soml)
- Simplify a SOML entity schema significantly, so it can be communicated more easily to LLM for querying (soml-simplify)
Take for example these 2 classes and 2 properties from the OP profile:
cim:AccumulatorReset a owl:Class ;
rdfs:label "AccumulatorReset" ;
rdfs:comment "This command resets the counter value to zero."@en ;
cims:belongsToCategory op:Package_OperationProfile ;
cims:stereotype uml:concrete ;
rdfs:subClassOf cim:Control .
cim:Control a owl:Class ;
rdfs:label "Control" ;
rdfs:comment "Control is used for supervisory/device control. It represents control outputs that are used to change the state in a process, e.g. close or open breaker, a set point value or a raise lower command."@en ;
cims:belongsToCategory op:Package_OperationProfile ;
rdfs:subClassOf cim:IOPoint .
cim:AccumulatorReset.AccumulatorValue a owl:ObjectProperty, owl:FunctionalProperty, owl:InverseFunctionalProperty ;
rdfs:label "AccumulatorValue" ;
rdfs:comment "The accumulator value that is reset by the command."@en ;
cims:AssociationUsed "Yes" ;
cims:multiplicity cims:M:1..1 ;
owl:inverseOf cim:AccumulatorValue.AccumulatorReset ;
rdfs:domain cim:AccumulatorReset ;
rdfs:range cim:AccumulatorValue .
cim:Control.PowerSystemResource a owl:ObjectProperty, owl:FunctionalProperty ;
rdfs:label "PowerSystemResource" ;
rdfs:comment "Regulating device governed by this control output."@en ;
cims:AssociationUsed "Yes" ;
cims:multiplicity cims:M:0..1 ;
owl:inverseOf cim:PowerSystemResource.Controls ;
rdfs:domain cim:Control ;
rdfs:range cim:PowerSystemResource .
The are translated to the following SOML:
objects:
AccumulatorReset:
descr: This command reset the counter value to zero
inherits: ControlInterface
label: AccumulatorReset
props:
accumulatorReset.AccumulatorValue: {}
type: cim:AccumulatorReset
ControlInterface:
descr: Abstract superclass of Control
inherits: IdentifiedObjectInterface
kind: abstract
props:
control.PowerSystemResource: {}
properties:
accumulatorReset.AccumulatorValue:
descr: The accumulator value that is reset by the command
inverseOf: accumulatorValue.AccumulatorReset
kind: object
label: AccumulatorValue
max: 1
min: 1
range: AccumulatorValue
rdfProp: cim:AccumulatorReset.AccumulatorValue
control.PowerSystemResource:
descr: 'The controller outputs used to...'
inverseOf: powerSystemResource.Controls
kind: object
label: PowerSystemResource
max: inf
min: 0
range: PowerSystemResourceInterface
rdfProp: cim:Control.PowerSystemResource
soml-simplifty
leaves only the bare essentials needed for querying:
AccumulatorReset:
ISA: ControlInterface
accumulatorReset.AccumulatorValue: AccumulatorValue
ControlInterface:
ISA: IdentifiedObjectInterface
control.PowerSystemResource: [PowerSystemResourceInterface]
Slide 16 of the presentation
Talk2PowerSystem: Democratizing power system analytics
(Vladimir Alexiev and Svein Harald Olsen, CIM IEC TC57 meeting, EDF R&D, Paris-Saclay, 6 June 2025),
shows a size comparison of different schemas of a subset of CIM/CGMES 16 (Equipment*
+Geography*
+model
):
- turtle 1.56M
- SOML 260k (16x shorter)
- simplified 37k (42x shorter)
While we have used SOML-simplified for GraphQL querying, preliminary experiments show that it is also useful for SPARQL querying. But more work is needed to put this in production.
You can see some examples in owl2soml/eg, eg:
- dbo.ttl is a version of the DBpedia ontology, which is pretty large (1314kb)
- dbo.yaml is the respective SOML (625k)
- dbo-simplified.yaml is the SOML-simplified (103k)