Quality Evaluation - statnett/Talk2PowerSystem GitHub Wiki
Evaluating the quality of AI software systems is essential for informed decision-making, cost-benefit analysis, and ensuring reliable system behavior. Talk2PowerSystem Chat is a ReAct-based agent, and while various evaluation frameworks exist for assessing such systems, most rely on using a large language model (LLM) as a judge to measure accuracy. A comprehensive review of these frameworks and tools is available here.
Although LLM-based evaluation can be effective in certain contexts, it presents two major issues: (1) it is resource-intensive and time-consuming, and (2) it introduces uncertainty, as LLMs are inherently nondeterministic and not always reliable. Relying on one LLM to evaluate another LLM can become an evaluation problem in itself — what might be called a "paradox of evaluation".
Inspired by the approach in "CrunchQA: A Synthetic Dataset for Question Answering over Crunchbase Knowledge Graph", we chose instead to build a QA dataset with expected tool calls and corresponding outputs to measure the system quality. Our evaluation focuses on the agent’s ability to interpret natural language questions and execute the correct sequence of tool invocations with appropriate arguments. Rather than scoring the quality of the natural language response, we assess the model’s understanding of the query and the data it retrieves to answer it. This approach aligns with the goals of the Talk2PowerSystem project, which emphasizes transparency and explainability alongside accuracy.
To support this evaluation methodology, we developed a Python library called qa-eval that is agnostic to the underlying implementation or LLM model used. It accepts as input a QA dataset and the system’s output, and evaluates not only the accuracy of the responses, but also the execution performance and estimated cost—calculated based on token usage statistics. This makes it suitable for comparing different agent implementations in a consistent and reproducible manner.
The accuracy for a single question from the QA dataset is calculated under the following assumptions:
- For each question, the QA dataset defines a sequence of expected tool calls. Multiple tools can be invoked at the same step, hence the expected tool calls are represented as a list of lists, e.g.,
[t₁, ..., tₙ], ..., [tₖ, ..., tₘ](/statnett/Talk2PowerSystem/wiki/t₁,-...,-tₙ],-...,-[tₖ,-...,-tₘ)
- We assume that the final step in this sequence,
[tₖ, ..., tₘ]
, contains the tool(s) responsible for retrieving the data needed to answer the question.
Accuracy is computed by comparing the outputs of the final tool calls [tₖ, ..., tₘ]
against the system's actual tool calls. If none of the expected tools in the final step have a matching output in the system's response, the accuracy is 0. Otherwise, accuracy is defined as the number of matched tool calls in the final step divided by the total number of expected tool calls in that step.
Comparing if two tool calls are a match depends on the tools themselves:
- SPARQL query tool - comparing if two SPARQL queries are semantically equivalent is undecidable problem [1], so we don't compare the SPARQL queries, but we compare the outputs and if they are equivalent
[1]: Melisachew Wudage Chekol, Jérôme Euzenat, Pierre Genevès, Nabil Layaïda. Evaluating and benchmarking SPARQL query containment solvers. Proc. 12th International semantic web conference (ISWC), Oct 2013, Sydney, Australia. pp.408-423, ff10.1007/978-3-642-41338-4_26ff. ffhal-00917911f