Ticket 270 ‐ Implement Prometheus and Grafana Monitoring System - SupaHotBall/OE2-Group-D GitHub Wiki

Task

  • Prometheus Setup
  • Download and install Prometheus v2.53.4
  • Configure prometheus.yml with proper scrape intervals
  • Create systemd service for Prometheus
  • Verify web interface at http://:9090
  • Node Exporter Configuration
  • Install node_exporter v1.9.1 on all monitored servers
  • Set up systemd service for node_exporter
  • Add node_exporter targets to Prometheus config
  • Alert Rules Implementation
  • Create two rule files (rule1.yml, rule2.yml) with:
  • Instance down detection
  • High CPU/Memory usage alerts
  • Disk space warnings
  • Validate rules using promtool
  • Configure in prometheus.yml
  • Grafana Installation
  • Install and configure Grafana
  • Add Prometheus as data source
  • Import Node Exporter dashboard (ID: 1860)
  • Verification
  • Confirm all targets are UP in Prometheus
  • Validate alert rules are evaluating correctly
  • Check Grafana dashboards display metrics properly

Steps Taken

Download Prometheus in the backup server

wget https://github.com/prometheus/prometheus/releases/download/v2.53.4/prometheus-2.53.4.linux-amd64.tar.gz

image

Unpack the tar file and move into the extracted directory

tar -xvf prometheus-2.53.4.linux-amd64.tar.gz

image

cd prometheus-2.53.4.linux-amd64

image

Start Prometheus

sudo ./prometheus --config.file=prometheus.yml

Access the Prometheus UI at http://13.75.179.26:9090/

image

Create a systemd service for Prometheus. First, copy the Prometheus files to /usr/local/bin/prometheus

sudo cp -r . /usr/local/bin/prometheus

image

Then create a service file at /etc/systemd/system/prometheus.service

sudo nano /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus/prometheus \
--config.file=/usr/local/bin/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/data \
--web.console.templates=/usr/local/bin/prometheus/consoles
RestartSec=5s
[Install]
WantedBy=multi-user.target

image

sudo mkdir -p /var/lib/prometheus/data

image`

Create a user and group for Prometheus

sudo useradd --no-create-home --shell /bin/false prometheus

Ensure that appropriate ownerships are set

sudo chown -R prometheus:prometheus /var/lib/prometheus

image

Reload systemd, start Prometheus and check its status

sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl status prometheus

image

Visit http://13.75.179.26:9090/targets to access the Prometheus dashboard. It should show in this instance, one target

image

Enter a metric name in the text box to view a summary of the collected data for the selected metric

promhttp_metric_handler_requests_total

image

Set up a Node Exporter by first downloading Node Exporter onto the backup server

wget https://github.com/prometheus/node_exporter/releases/download/v1.9.1/node_exporter-1.9.1.linux-amd64.tar.gz

image

Unpack the tarbell

tar -xzf node_exporter-1.9.1.linux-amd64.tar.gz

image

Cd into the unpacked directory and run the command ./node_exporter

image

Copy node_exporter executable into /user/local/bin/

sudo cp node_exporter /usr/local/bin/

Create a systemd service f ile to manage the exporter service

sudo nano /etc/systemd/system/node_exporter.service

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=nodeuser
Group=nodeuser
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

Create a user without the login shell for nodeuser

sudo useradd --no-create-home --shell /usr/sbin/nologin nodeuser

Set the permissions

sudo chown nodeuser:nodeuser /usr/local/bin/node_exporter

Reload systemd and start node_exporter

sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl status node_exporter

image

Test that the node_exporter service file is running on http://13.75.179.26:9100/

image

Configure the new target in the prometheus.yml file which can be found at /usr/local/bin/prometheus/prometheus.yml Add the following code under the scrape_configs section

- job_name: ’node-exporter’
static_configs:
- targets: [’localhost:9100’]

image

Restart Prometheus after saving the configuration

sudo systemctl restart prometheus

image

Check that the node-exporter target appears in the targets list on the Prometheus page

image

Add rule groups to Prometheus in the same directory that the prometheus.yml file is located and create 2 new files (rule1.yml, rule2.yml):

image

rule1.yml:

groups:
  - name: record-rules
    interval: 30s
    rules:
      - record: node_memory_MemFree_in_percent
        expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes)
        labels:
          team: backend

  - name: alert-rules
    interval: 30s
    rules:
      - alert: InstanceDown
        expr: instance:up == 0
        for: 1m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "Instance [{{ $labels.instance }}] down"
          description: "[{{ $labels.instance }}] of [{{ $labels.job }}] has been down for more than 1 minute."

rule2.yml:

groups:
  - name: example-rules
    interval: 30s
    rules:
      - record: job:http_inprogress_requests:sum
        expr: sum(http_inprogress_requests) by (job)
        labels:
          team: backend

      - alert: HighCPULoad
        expr: instance:node_cpu_utilization:rate5m > 0.85
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU load on {{ $labels.instance }}"
          description: "CPU usage is above 85% for more than 10 minutes."

Check that both rules are correctly defined using Promtool

sudo ./promtool check rules rule1.yml

image

Add the rules to the prometheus.yml file:

image

Reload the daemon and restart Prometheus

sudo systemctl daemon-reload sudo systemctl restart prometheus

image

Install and configure Grafana by first installing Grafana dependencies

# Install dependencies
sudo apt-get install -y apt-transport-https software-properties-common
# Add Grafana GPG key
sudo wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
# Add Grafana APT repository
sudo echo "deb https://packages.grafana.com/oss/deb stable main" \
| sudo tee -a /etc/apt/sources.list.d/grafana.list
# Update package list
sudo apt-get update
# Install Grafana
sudo apt-get install grafana
sudo systemctl start grafana-server

image

Start and enable Grafana

sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server

Ensure that firewall allows incoming TCP traffic from port 3000 which is the default port that Grafana uses

sudo ufw allow 3000/tcp

image

Grafana is located at http://13.75.179.26:3000/login. The username and password by default will both be admin. I have changed the password to the same password that our servers are using

image

image

Next go to Connections > Data Sources > Add a data source and select Prometheus. Enter the ip of the prometheus site. Click Save and Test to verify the connection.

image

image

Import the Node Exporter and dashboard by clicking the + icon to import a dashboard. Then, enter the dashboard ID 1860 and click "Load"

image

Select Prometheus as the data source and click on Import

image

Verify that the dashboard and node exporter targets are up

image

image


Troubleshooting

image

Did not find expected key error is usually caused by incorrect formatting. The yml file did not have correct indentation, after fixing the formatting of the file and re-running the checking command, the rules have been found

image


External Resources

N/A


Ticket Reference

https://rt.dataraster.com/Ticket/Display.html?id=270


⚠️ **GitHub.com Fallback** ⚠️