Evaluation Results - statnett/Talk2PowerSystem GitHub Wiki
We did a few cycles of generating a QA dataset and running the evaluation. The final results can be found in this folder. We randomly shuffle and split the templates from the QA dataset into 3 parts / splits - train(80%), dev(10%) and test(10%). The dev split consists of 38 templates and 159 questions. The test split consists of 38 templates and 147 questions.
We have 3 experimental setups:
- Using gpt-4.0 as a model
- Using gpt-4.1 as a model
- Using gpt-4.1 as a model and N-Shot tool. For the N-Shot tool we index in the vector store all questions from all templates from the train and dev split. The parameters in the questions are replaced with placeholders. For example,
$ObjectIdentity(0, cim:SubGeographicalRegion)
is replaced with<SubGeographicalRegion>
and$ValueFilter(cim:GeneratingUnit, cim:GeneratingUnit.maxOperatingP, xsd:float)
- with<float>
Below table summarizes the results from the 3 experiments
Micro Mean Answer Score Dev | Macro Mean Answer Score Dev | Micro Mean Answer Score Test | Macro Mean Answer Score Test | |
---|---|---|---|---|
gpt-4.0 | 0.4151 | 0.4671 | 0.5170 | 0.5382 |
gpt-4.1 | 0.6226 | 0.6118 | 0.6395 | 0.6504 |
gpt-4.1 + n-shot | 0.6415 | 0.6789 | 0.7279 | 0.7434 |
We see a significant improvement in the results with gpt-4.1 compared to gpt-4.0. However, the improvement from using the N-shot tool is lower than expected, especially on the dev split, since all of the questions from the it are actually indexed in the vector store and the LLM can use the tool to fetch them and generate the expected SPARQL queries.
Additional analysis of the errors are required, and also exploring and evaluating different ways to index the data for the N-Shot tool.