Components Healthcheck Corrective Actions - statnett/Talk2PowerSystem GitHub Wiki

Components

The Talk2PowerSystem chatbot now shows a Components page.

  • Due to a bug to be fixed soon, you may need to first go to the Chatbot, then click on Components

That lists all agent settings, ontologies, datasets, GraphDB, chatbot backend and frontend components, with their identification, version, date, dependencies, etc.

As I write this, the bot:

  • Has version 1.2.0-rc4 of 2025-10-27T15:24:19Z with git SHA 8cb067d72eedfdbf0cf29b14d8be33f613ce827b
  • Runs on Python 3.12.11 (main, Sep 30 2025, 00:38:52) built with GCC 14.2.0
  • Runs on GraphDB version 11.1.1+sha.82602bfa

It shows the 25 or so ontologies in the database, and the 43 datasets.

  • For example, one of the datasets is Equipment (EQ) part of the Nordic 44-bus synthetic test model developed by Statnett SF of the Nordic region dated 2025-02-14
  • Clicking on any ontology or datasets shows you all its metadata in GraphDB

The reason we show all this info, and record it for every evaluation campaign, is so that we can we can report the precise components that were used for a particular evaluation.

In the near future we'll add more functionalities as described below: monitoring, security and technical requirements to guarantee a certain level of availability. Task: #193 design Monitoring system (Health Check, Notification, Corrective Action)

Monitoring (Healthchecks)

Currently we have some NGinx checks, and specific checks in the chatbot backend, leading to sometimes improper corrective actions, too many restarts, and unclear error info going to the user. The chatbot code checks that:

  • The semantic database (GraphDB) is up
  • The Autocomplete index is enabled and up to date (this together with RDFRank is used by the Identify Object tool, which resolves "string to thing", eg "Arendal" to the URN of that Substation power system resource)
  • RDFRank is enabled and up to date

But these checks don't belong in the chatbot code because they couple it tightly to underlying components.

Instead, we will select a monitoring framework (NGinx which is intended for simpler checks or Amazon CloudCheck), and will develop independent checks per component, and component dependencies to ensure "tiered availability"

The healthcheck will take care of details such as:

  • Initial startup, which takes more time, and the service needs to be locked out at the gateway until it is fully up
  • Notification to admins using Email or PagerDuty
  • The Components page will be extended with Healthcheck status, where the sysadmin or end-user can see the health status of the various components
  • The Chatbot page will show a red banner with a link to the detailed Components page in case there is any problem

Corrective Actions

Corrective actions will depend on component and may include:

  • Restart of a service with up to N retries
  • Rebuild required indexes