Evaluation iteration #1

We did a few cycles of generating a QA dataset and running the evaluation. The final results can be found in this folder. We randomly shuffle and split the templates from the QA dataset into 3 parts / splits - train(80%), dev(10%) and test(10%). The dev split consists of 38 templates and 159 questions. The test split consists of 38 templates and 147 questions.

We have 3 experimental setups:

Using gpt-4.0 as a model
Using gpt-4.1 as a model
Using gpt-4.1 as a model and N-Shot tool. For the N-Shot tool we index in the vector store all questions from all templates from the train and dev split. The parameters in the questions are replaced with placeholders. For example, $ObjectIdentity(0, cim:SubGeographicalRegion) is replaced with <SubGeographicalRegion> and $ValueFilter(cim:GeneratingUnit, cim:GeneratingUnit.maxOperatingP, xsd:float) - with <float>

Results

Below table summarizes the results from the 3 experiments

	Micro Mean Answer Score Dev	Macro Mean Answer Score Dev	Micro Mean Answer Score Test	Macro Mean Answer Score Test
gpt-4.0	0.4151	0.4671	0.5170	0.5382
gpt-4.1	0.6226	0.6118	0.6395	0.6504
gpt-4.1 + n-shot	0.6415	0.6789	0.7279	0.7434

We see a significant improvement in the results with gpt-4.1 compared to gpt-4.0. However, the improvement from using the N-shot tool is lower than expected, especially on the dev split, since all of the questions from the it are actually indexed in the vector store and the LLM can use the tool to fetch them and generate the expected SPARQL queries.

Additional analysis of the errors are required, and also exploring and evaluating different ways to index the data for the N-Shot tool.

Error analysis

Tasks to address issues

https://github.com/statnett/Talk2PowerSystem_PM/issues/122

Evaluation Results - statnett/Talk2PowerSystem GitHub Wiki

Evaluation iteration #1

Results

Error analysis

Tasks to address issues

⚠️ GitHub.com Fallback ⚠️

Evaluation Results - statnett/Talk2PowerSystem GitHub Wiki

Evaluation iteration #1

Results

Error analysis

Tasks to address issues

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️