Monitoring and alerting - alphagov/notifications-manuals GitHub Wiki

Alerting

We use the following alerting systems:

Monitoring

We run a Grafana instance to visualise our metrics: https://grafana.notify.tools/.

We have various dashboards for monitoring application system health, individual applications and other services (such as Redis & PostgreSQL).

Our Prometheus and Grafana deployment runs on AWS ECS using AWS's managed prometheus in a dedicated monitoring account (notify-monitoring) which all other accounts forward metrics to.

We use a local prometheus in each ecs cluster for endpoint discovery. These prometheus's forward metrics to a central aws managed prometheus instance in the monitoring account. We do not query these prometheus servers directly.

Metrics Data Sources

Data Source: AWS CloudWatch

We also export some metrics to AWS Cloudwatch. These can be viewed directly in Cloudwatch, or via Grafana.

Data Source: Prometheus

We host our own AWS Managed Prometheus instance running on the notify monitoring account. Unfortunately with the aws managed service you don't get the normal prometheus UI, so we only use Grafana to visualise the data.

Stateful Services Monitoring

Database level metrics provided by the aws cloudwatch

Database metrics are collected by aws and stored in cloudwatch. You can query these metrics from the aws console or from grafana.

Redis metrics provided by AWS Elasticache

We use the Redis backing service provided by AWS Elasticache. Metrics for this service are provided by AWS Cloudwatch

Application Monitoring

Application metrics are available from prometheus and cloudwatch.

Metric collection

The diagram below shows how we collect metrics to our central AWS managed Prometheus instance:

log collection

Notes:

  • A task (web or worker) is a single instance of an application and runs multiple containers (think kubernetes pod). For example; api-web could be running 30 tasks each having 3 containers.
  • We run more than one prometheus in each environment. The AWS managed Prometheus deduplicates the metrics and stores them.
  • We have a different worker and web setup because the workers don't have a listening http port. As such we use a statsd-exporter to collect and expose the metrics.
  • As of the time of the writing I don't believe any of the apps are pushing otlp information to the otel sidecar. This is something we could look into in the future.

ECS Prometheus Task Discovery

We use the prometheus-sdconfig-reloader to dynamically add services and tasks to be scraped by prometheus.

The prometheus-sdconfig-reloader is deployed as a sidecar to the prometheus container in the ecs task definition. It uses the aws sdk to generate a prometheus scrape config file. This config file is then written to a shared volume that the prometheus container reads from.

Unfortunately the prometheus-sdconfig-reloader truncates the ecs-services.json file before writing to it. This can trigger a prometheus load of an empty file. As such we have another sidecar to mv the ecs-services.json file to a ecs-services-real.json file that prometheus reads.

Adding new application to be scraped

The prometheus-sdconfig-reloader uses AWS Cloud Map to discover services to scrape. Services that want to be scraped will need to register themselves with AWS Cloud Map and add tags to the service. Tags are METRICS_PATH and METRICS_PORT. The METRICS_PATH is the path to the metrics endpoint and the METRICS_PORT is the port the metrics endpoint is running on.

If you are using the ecs-service terraform module with the otel sidecar you just need to have prometheus.enabled set.

Application level metrics provided by adding gds-metrics-python to our flask apps

For some of our Flask applications we have added the gds-metrics-python library. This exposes flask level metrics for our python apps at a /metrics endpoint on the application. It is also easy to add our own custom application level metrics using this.

AWS ECS Metadata

Metadata is exposed by aws about the ecs task. This is scraped by the otel sidecar. Information on the ecs metadata endpoints can be found in the aws documentation.

Data Source: StatsD

Celery workers can't expose a /metrics endpoint to Prometheus, since they have no web frontend. Instead, we have a StatsD sidecar (statsd-exporter), which acts as an intermediary in AWS ECS.

Using StatsD as a way to get metrics into Prometheus isn't ideal, since StatsD does its own manipulation of the data. We should only use StatsD in code run by Celery workers.

Data Source: OTEL

We currently use the open telemetery sidecar in ECS to allow us to get metrics on a container level instead of the AWS default of the task level.

The otel sidecar scapes the metrics from either the application's /metric endpoint or the statsd-exporter's /metrics endpoint

StatsD client

We have several ways we use StatsD in our worker code:

In order to send stats to our StatsD server, Celery workers are configured with STATSD_HOST and STATSD_PORT environment variables. All data is sent over UDP, to avoid the connection to the server becoming a bottleneck in critical worker code.

Monitoring Concourse

You can view metrics around our concourse CPU usage, worker count, etc at https://grafana.monitoring.concourse.notify.tools/. Sign in via your github account.