Configuration for Prometheus scraping, Grafana dashboards, and alerting rules - nwanoch/hng_boilerplate_python_fastapi_web GitHub Wiki

Prometheus Configuration

Now that Node Exporter is running, we need to configure Prometheus to scrape its metrics. Add the following job to your prometheus.yml file:

   scrape_configs:
     - job_name: 'node'
     static_configs:
       - targets: ['localhost:9100']

Restart prometheus after adding these changes:

   sudo systemctl restart prometheus

Grafana Configuration

One of the great advantages of using Node Exporter is the availability of pre-built dashboards. We would need to import a comprehensive dashboard for our Node Exporter metrics.

  1. Sign in to your grafana url, e.g., https://api-python.boilerplate.hng.tech:3001
  2. Click on dashboards and create a new dashboard. Click import dashboard and enter '1860' in the dashboard ID
  3. Select your Prometheus data source in the dropdown. Click "Import" to finalize.

You should now see a detailed dashboard with various panels showing CPU, memory, disk, and network metrics.

Setting Up Grafana Alerting

Now that we have our metrics visualized, let's set up some alerts to notify us when things go wrong.

Create a contact point and custom notification templates

Creating Contact Points and Custom Notification Templates in Grafana Contact points in Grafana are the destinations where your alerts will be sent. These can be various communication channels such as email, Slack, PagerDuty, webhook, and more. Our communication channel is slack so we will be creating a slack app and webhook url

  1. Follow these steps here to create a slack app bot and webhook url
  2. Navigate to contact points and set the integration to slack and include the slack url
  3. In the optional slack settings, set up a custom title and message custom title The custom title template creates a concise, informative title for each alert:
{{ define "alerts.title" -}}
{{ if .Alerts.Firing -}}
{{ range .Alerts.Firing }}
Anchor-Python Alert: {{ .Labels.alertname }} - Severity: {{ index .Labels "severity" | toUpper }} 
{{ end }}
{{- end }}
{{ if .Alerts.Resolved -}}
{{ range .Alerts.Resolved }}
Anchor-Python Alert: {{ .Labels.alertname }} - Severity: {{ index .Labels "severity" | toUpper }} 
{{ end }}
{{- end }}
{{- end }}

This template creates a separate title for each alert, includes the alert name and severity, and distinguishes between firing and resolved alerts

custom message

The message template provides more detailed information about each alert:

{{ define "alerts.message" -}}
{{ if .Alerts.Firing -}}
🚨 {{ len .Alerts.Firing }} Alert(s) Firing 🚨
{{ range .Alerts.Firing }}
---
🔔 *Alert:* {{ .Labels.alertname }}
📊 *Severity:* {{ index .Labels "severity" | toUpper }}
📝 *Summary:* {{ index .Annotations "summary" }}
{{- if index .Annotations "description" }}
🔍 *Description:* {{ index .Annotations "description" }}
{{- end }}
⏰ *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{- end }}
{{- if .Alerts.Resolved -}}
✅ {{ len .Alerts.Resolved }} Alert(s) Resolved
{{ range .Alerts.Resolved }}
---
🔔 *Alert:* {{ .Labels.alertname }}
📝 *Summary:* {{ index .Annotations "summary" }}
{{- if index .Annotations "description" }}
🔍 *Description:* {{ index .Annotations "description" }}
{{- end }}
⏰ *Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{- if .EndsAt.After .StartsAt }}
🏁 *Ended:* {{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{- end }}
{{ end }}
{{- end }}
{{- end }}

This custom message shows the total number of firing/resolved alerts, provides detailed information for each alert, including name, severity, summary, and timing, uses emojis and formatting for improved readability, separates alerts with horizontal lines for clarity.

Use the "Test" button to send a sample notification.

Creating an Alert Rule

Navigate to Alerting > Alert rules. Click "New alert rule"

We set up 8 alert rules for different system metrics:

High CPU Usage

Query:

100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="localhost:9100"}[5m])))

This query computes the CPU usage percentage by subtracting the average idle CPU time rate from 1, multiplying by 100 to convert it into a percentage

The reduce function captures the last value of the computed CPU usage. Threshold triggers an alert when the CPU usage exceeds 80%. Math triggers an additional alert when the CPU usage exceeds 90%.

We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.

{{ if (gt $values.B.Value 90) -}}
Critical
{{ else if (gt $values.B.Value 80) -}}
Warning
{{ else -}}
low
{{- end }}

The alert message contains a Severity: Critical when the CPU usage is above 90%, and Severity: Warning when it is above 80%

High Memory Usage

Query:

(1 - (node_memory_MemAvailable_bytes{instance="localhost:9100", job="node_exporter"} / node_memory_MemTotal_bytes{instance="localhost:9100", job="node_exporter"})) * 100

This query calculates the percentage of used memory by determining the proportion of memory that is not available (i.e., used), and then converting this fraction into a percentage

The reduce function captures the last value of the computed CPU usage. Threshold triggers an alert when the Memory percentage exceeds 80%. Math triggers an additional alert when the Memory percentage exceeds 90%.

We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.

{{ if (gt $values.B.Value 90) -}}
Critical
{{ else if (gt $values.B.Value 80) -}}
Warning
{{ else -}}
low
{{- end }}

The alert message contains a Severity: Critical when the CPU usage is above 90%, and Severity: Warning when it is above 80%

Disk Usage

Query:

100 - ((node_filesystem_avail_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"})

This query calculates the percentage of used space on the filesystem mounted at "/", excluding filesystems of type "rootfs". It determines the proportion of the filesystem that is occupied by subtracting the available space percentage from 100.

The reduce function captures the last value of the computed CPU usage. Threshold triggers an alert when the Disk usage exceeds 80%. Math triggers an additional alert when the Disk usage exceeds 90%.

We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.

{{ if (gt $values.B.Value 90) -}}
Critical
{{ else if (gt $values.B.Value 80) -}}
Warning
{{ else -}}
low
{{- end }}

The alert message contains a Severity: Critical when the CPU usage is above 90%, and Severity: Warning when it is above 80%

Network traffic

Query:

irate(node_network_transmit_bytes_total{instance="localhost:9100",job="node_exporter"}[5m])*8

This query calculates the near-instantaneous rate of network data transmission in bits per second over a 5-minute period. It provides an understanding of the network throughput for the specified instance, which is helpful for monitoring network performance and detecting potential issues like network congestion.

The reduce function captures the last value of the computed network data. Threshold triggers an alert when the network data transmitted is over 100MBi/s

Network error

Query:

increase(node_network_transmit_errs_total[1h]) + increase(node_network_receive_errs_total[1h])

This query calculates the total number of network transmission and reception errors over the past hour. It provides a comprehensive view of network reliability and helps in identifying and diagnosing network issues.

The reduce function captures the last value of the computed network transmission errors. Threshold triggers an alert when the network errors is above 2. Math triggers an additional alert when the network errors is above 5.

We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.

{{ if (gt $values.B.Value 5) -}}
Critical
{{ else if (gt $values.B.Value 2) -}}
Warning
{{ else -}}
low
{{- end }}

The alert message contains a Severity: Critical when the network error is above 5, and Severity: Warning when it is above 2

System load

Query:

scalar(node_load1{instance="localhost:9100",job="node_exporter"}) * 100 /count(count(node_cpu_seconds_total{instance="localhost:9100",job="node_exporter"}) by (cpu))

This query calculates the 1-minute CPU load average as a percentage of the total CPU capacity. It provides an indication of how heavily the CPUs are being utilized relative to their total capacity.

The reduce function captures the last value of the computed system load. Threshold triggers an alert when the disk I/O is above 80. Math triggers an additional alert when the disk I/O is above 90.

We also set up custom labels for severity that are sent out when the Threshold alert condition is triggered and when the Math alert condition is triggered.

{{ if (gt $values.B.Value 90) -}}
Critical
{{ else if (gt $values.B.Value 80) -}}
Warning
{{ else -}}
low
{{- end }}

The alert message contains a Severity: Critical when the system load is above 90, and Severity: Warning when it is above 80