Quality Evaluation - statnett/Talk2PowerSystem GitHub Wiki

Evaluating the quality of AI software systems is essential for informed decision-making, cost-benefit analysis, and ensuring reliable system behavior. Talk2PowerSystem Chat is a ReAct-based agent, and while various evaluation frameworks exist for assessing such systems, most rely on using a large language model (LLM) as a judge to measure accuracy. A comprehensive review of these frameworks and tools is available here.

Although LLM-based evaluation can be effective in certain contexts, it presents two major issues: (1) it is resource-intensive and time-consuming, and (2) it introduces uncertainty, as LLMs are inherently nondeterministic and not always reliable. Relying on one LLM to evaluate another LLM can become an evaluation problem in itself — what might be called a "paradox of evaluation".

Inspired by the approach in "CrunchQA: A Synthetic Dataset for Question Answering over Crunchbase Knowledge Graph", we chose instead to build a gold-standard corpus with expected tool calls and outputs. Our evaluation focuses on the agent’s ability to interpret natural language questions and execute the correct sequence of tool invocations with appropriate arguments. Rather than scoring the quality of the natural language response, we assess the model’s understanding of the query and the data it retrieves to answer it. This approach aligns with the goals of the Talk2PowerSystem project, which emphasizes transparency and explainability alongside accuracy.

To support this evaluation methodology, we developed a framework called TTYG Evaluation that is agnostic to the underlying implementation or LLM model used. It accepts as input a gold-standard corpus (GSC) and the system’s output, and evaluates not only the accuracy of the responses, but also the execution performance and estimated cost—calculated based on token usage statistics. This makes it suitable for comparing different agent implementations in a consistent and reproducible manner.

The accuracy for a single question in the GSC is calculated under the following assumptions:

For each question, the GSC defines a sequence of expected tool calls. Multiple tools can be invoked at the same step, represented as a list of lists, e.g., [t₁, ..., tₙ], ..., [tₖ, ..., tₘ](/statnett/Talk2PowerSystem/wiki/t₁,-...,-tₙ],-...,-[tₖ,-...,-tₘ)
We assume that the final step in this sequence, [tₖ, ..., tₘ], contains the tool(s) responsible for retrieving the data needed to answer the question.

Accuracy is computed by comparing the outputs of the final tool calls [tₖ, ..., tₘ] against the system's actual tool calls. If none of the expected tools in the final step have a matching output in the system's response, the accuracy is 0. Otherwise, accuracy is defined as the number of matched tool calls in the final step divided by the total number of expected tool calls in that step.

A tool call is considered a match if its output is equivalent to that of an expected tool. We do not compare the tools based on the input arguments, as this is not trivial, i.e. for SPARQL query tool comparing if two SPARQL queries are semantically equivalent is undecidable problem [1].

[1]: Melisachew Wudage Chekol, Jérôme Euzenat, Pierre Genevès, Nabil Layaïda. Evaluating and benchmarking SPARQL query containment solvers. Proc. 12th International semantic web conference (ISWC), Oct 2013, Sydney, Australia. pp.408-423, ff10.1007/978-3-642-41338-4_26ff. ffhal-00917911f