Monitoring and alerting - alphagov/notifications-manuals GitHub Wiki
Alerting
We use the following alerting systems:
-
Emails to notify-support@
- Elastalert (configured as part of the Logit ELK stack and stored as code in notifications-aws).
- Cronitor (configured in our code).
-
CloudWatch Alarms (configured in our terraform code)
-
Slack
- Functional tests.
- PagerDuty alerts.
-
Zendesk tickets
- Automated tickets from checks in the API app.
- Automated tickets generated from Cyber Security team's splunk (see Production Access for more information)
Monitoring
We run a Grafana instance to visualise our metrics: https://grafana.notify.tools/.
We have various dashboards for monitoring application system health, individual applications and other services (such as Redis & PostgreSQL).
Our Prometheus and Grafana deployment runs on AWS ECS using AWS's managed prometheus in a dedicated monitoring account (notify-monitoring) which all other accounts forward metrics to.
We use a local prometheus in each ecs cluster for endpoint discovery. These prometheus's forward metrics to a central aws managed prometheus instance in the monitoring account. We do not query these prometheus servers directly.
Metrics Data Sources
Data Source: AWS CloudWatch
We also export some metrics to AWS Cloudwatch. These can be viewed directly in Cloudwatch, or via Grafana.
Data Source: Prometheus
We host our own AWS Managed Prometheus instance running on the notify monitoring account. Unfortunately with the aws managed service you don't get the normal prometheus UI, so we only use Grafana to visualise the data.
Stateful Services Monitoring
Database level metrics provided by the aws cloudwatch
Database metrics are collected by aws and stored in cloudwatch. You can query these metrics from the aws console or from grafana.
Redis metrics provided by AWS Elasticache
We use the Redis backing service provided by AWS Elasticache. Metrics for this service are provided by AWS Cloudwatch
Application Monitoring
Application metrics are available from prometheus and cloudwatch.
Metric collection
The diagram below shows how we collect metrics to our central AWS managed Prometheus instance:
Notes:
- A task (web or worker) is a single instance of an application and runs multiple containers (think kubernetes pod). For example; api-web could be running 30 tasks each having 3 containers.
- We run more than one prometheus in each environment. The AWS managed Prometheus deduplicates the metrics and stores them.
- We have a different worker and web setup because the workers don't have a listening http port. As such we use a statsd-exporter to collect and expose the metrics.
- As of the time of the writing I don't believe any of the apps are pushing otlp information to the otel sidecar. This is something we could look into in the future.
ECS Prometheus Task Discovery
We use the prometheus-sdconfig-reloader to dynamically add services and tasks to be scraped by prometheus.
The prometheus-sdconfig-reloader is deployed as a sidecar to the prometheus container in the ecs task definition. It uses the aws sdk to generate a prometheus scrape config file. This config file is then written to a shared volume that the prometheus container reads from.
Unfortunately the prometheus-sdconfig-reloader truncates the ecs-services.json file before writing to it. This can trigger a prometheus load of an empty file. As such we have another sidecar to mv the ecs-services.json file to a ecs-services-real.json file that prometheus reads.
Adding new application to be scraped
The prometheus-sdconfig-reloader uses AWS Cloud Map to discover services to scrape. Services that want to be scraped will need to register themselves with AWS Cloud Map and add tags to the service. Tags are METRICS_PATH and METRICS_PORT. The METRICS_PATH is the path to the metrics endpoint and the METRICS_PORT is the port the metrics endpoint is running on.
If you are using the ecs-service terraform module with the otel sidecar you just need to have prometheus.enabled
set.
gds-metrics-python
to our flask apps
Application level metrics provided by adding For some of our Flask applications we have added the gds-metrics-python library. This exposes flask level metrics for our python apps at a /metrics endpoint on the application. It is also easy to add our own custom application level metrics using this.
AWS ECS Metadata
Metadata is exposed by aws about the ecs task. This is scraped by the otel sidecar. Information on the ecs metadata endpoints can be found in the aws documentation.
Data Source: StatsD
Celery workers can't expose a /metrics
endpoint to Prometheus, since they have no web frontend. Instead, we have a StatsD sidecar (statsd-exporter
), which acts as an intermediary in AWS ECS.
Using StatsD as a way to get metrics into Prometheus isn't ideal, since StatsD does its own manipulation of the data. We should only use StatsD in code run by Celery workers.
Data Source: OTEL
We currently use the open telemetery sidecar in ECS to allow us to get metrics on a container level instead of the AWS default of the task level.
The otel sidecar scapes the metrics from either the application's /metric endpoint or the statsd-exporter's /metrics endpoint
StatsD client
We have several ways we use StatsD in our worker code:
-
Inside our celary source code. This means we get timing and counter data about all tasks.
-
Ad-hoc uses of the
@statsd
decorator on lower-level functions. The decorator is provided by notifications-utils.
In order to send stats to our StatsD server, Celery workers are configured with STATSD_HOST
and STATSD_PORT
environment variables. All data is sent over UDP, to avoid the connection to the server becoming a bottleneck in critical worker code.
Monitoring Concourse
You can view metrics around our concourse CPU usage, worker count, etc at https://grafana.monitoring.concourse.notify.tools/. Sign in via your github account.