How to view and query our logs - alphagov/notifications-manuals GitHub Wiki

We have three logging systems:

  • Logit / Kibana - logs for Notify apps and cloudfront. Limited to 14 days.
  • AWS Cloudwatch - logs for Notify apps and things in AWS, such as EC2 instances.
  • AWS Athena - logs for CloudFront and AWS load balancers.

Logit / Kibana

This is our preferred way to search logs for Notify apps, since it's easier to search across all logs or a subset of them. You may need to switch to CloudWatch if you need logs over a longer period (more than 14 days), or if you want to do additional stats processing.

How to search for error logs

levelname:ERROR OR status:[500 TO 599]

How to use it more generally

By default, all searches do a partial, case-insensitive match:

# search all fields for a string
"SES bounce"

# search a specific field (case insensitive)
levelname: error

# search a specific field (partial match)
path: "/documents"

You can also force an exact match on a field by adding .keyword:

# all logs for a specific Celery worker
application_name: notify-delivery-worker-periodic

Space are interpreted as "or". Use AND or && to match multiple terms e.g.

# logs containing 'pdfs' and 'upload' for a worker
application_name: notify-delivery-worker-periodic AND pdfs AND upload

AWS CloudWatch

We originally started using CloudWatch due to an issue with the syslog-drain mechanism we use for Kibana. While Kibana logs should be reliable, CloudWatch is still useful for long term storage (1 year vs. 14 days) and may be easier to do some advanced processing on data within log messages.

Cloudwatch logs can be found in the aws account the service is running in.

How to use it

  • Go to "Logs > Log groups" to see what's available. Each log group is roughly equivalent to a normal log file.
  • Go to "Logs > Insights" to query the logs. You need to select a log group in order to run a query.
  • On the right hand side, click the "Queries" folder to view any saved queries for that environment.
  • Running a simple query and then expanding the first result is a good way to see the fields available.
  • The default query refers to a @message field, but many logs don't have this field: use message instead.

Example queries

# logs containing 'pdfs' and 'upload' for a worker (selected as the log group)
fields @timestamp, message
| filter message like "upload" and message like "pdfs"

# compute stats for times reported in the logs (alternative to using Grafana)
fields @timestamp, message
| filter message like "tasks call zip_and_send_letter_pdfs took"
| parse message "tasks call zip_and_send_letter_pdfs took *" as timetaken
| stats avg(timetaken) as avgtime, sum(timetaken) as sumtime, count(message) as taskcount by bin(1d) as day
| sort taskcount desc
| limit 500

AWS Athena

Athena is a query interface on top of structured log files stored in S3 buckets:

  • *-paas-cloudfront-proxy-logs - request logs for the Admin, API or Document Download apps
  • *-paas-forwarder-logs-bucket - legacy request logs (now going to -cloudfront- buckets)
  • notifications.service.gov.uk-elb-access-logs - legacy request logs (pre-Cloudfront)
  • production-document-download-logs - S3 logs for Document Download (no table in Athena)

Note that, at the time of writing, API traffic is not yet going through CloudFront, so request logs are still going to the legacy bucket.

How to use it

  • Choose a database in the selector on the left to see which tables it contains.
  • Click the three dots to the right of a table name to preview the data.
  • Always use a LIMIT or WHERE clause. We are charged for the volume of data queried.

Alternatively, you can also go to "Saved queries" to view and run any saved queries for the environment. Several saved queries are created by Terraform, for convenience.

If you want to know which table is associated with which bucket in S3, click on the three dots to the right of the table name, and choose Generate Create Table DDL to see how it's configured.

Example queries

# recent 5xx requests by IP address
SELECT request_ip, count(*) as count FROM "notifications_service_gov_uk"."cloudfront_www"
WHERE (status BETWEEN 500 AND 599) AND (date > CAST('2021-03-15' AS DATE))
GROUP BY request_ip
ORDER BY count DESC
LIMIT 10