Monitoring and Health - ericfitz/tmi GitHub Wiki

Monitoring and Health

This guide covers health checks, metrics collection, log aggregation, alerting, and performance and security monitoring for TMI deployments.

Overview

Effective monitoring is critical for maintaining TMI's availability, performance, and security. This guide provides practical procedures for:

Health checks and availability monitoring
Metrics collection and visualization
Log aggregation and analysis
Alerting configuration
Performance monitoring
Security event monitoring

Quick Health Checks

Server Health

# Basic health check (the root endpoint returns API info with health status)
curl https://tmi.example.com/

# Expected response structure:
{
  "status": {
    "code": "ok",              # "ok", "degraded", or "error"
    "time": "2025-01-24T..."
  },
  "service": {
    "name": "TMI",
    "build": "1.3.2-abc1234"   # format: version[-prerelease][+commit]
  },
  "api": {
    "version": "1.4.0",        # from OpenAPI spec, follows semver
    "specification": "https://github.com/ericfitz/tmi/blob/main/api-schema/tmi-openapi.json"
  },
  "operator": {                # optional, present only if configured
    "name": "Acme Corp",
    "contact": "[email protected]"
  }
}

# When status is "degraded", the response includes health details:
# "health": {
#   "database": { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 3, "message": "..." },
#   "redis":    { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 1, "message": "..." }
# }

# Check OAuth providers
curl https://tmi.example.com/oauth2/providers

Database Health

# PostgreSQL connection test
psql -h postgres-host -U tmi_user -d tmi -c "SELECT 1"

# Check database size
psql -h postgres-host -U tmi_user -d tmi -c "
  SELECT pg_size_pretty(pg_database_size('tmi'))"

# Check table row counts
psql -h postgres-host -U tmi_user -d tmi -c "
  SELECT schemaname, tablename, n_live_tup
  FROM pg_stat_user_tables
  ORDER BY n_live_tup DESC"

Redis Health

# Connection test
redis-cli -h redis-host -p 6379 -a password ping
# Expected: PONG

# Check memory usage
redis-cli -h redis-host -a password info memory | grep used_memory_human

# Check key count
redis-cli -h redis-host -a password DBSIZE

# Check cache hit rate
redis-cli -h redis-host -a password info stats | grep keyspace_hits

Monitoring Architecture

Observability Stack

[TMI Application] --> [Metrics Collection] --> [Time Series DB]
                 --> [Log Aggregation]     --> [Log Storage]
                 --> [Health Checks]       --> [Alerting System]

Key Components

Component	Purpose	Recommended Tool
Metrics Collection	Application and system metrics	Prometheus
Log Aggregation	Centralized logging	Grafana Alloy/Loki (recommended), ELK Stack (alternative)
Health Monitoring	Service availability and performance	Built-in health endpoint
Alerting	Proactive issue notification	Prometheus AlertManager
Dashboards	Visualization	Grafana

Note: Promtail reached End-of-Life on March 2, 2026. Use Grafana Alloy for new deployments. See TMI-Promtail-Logger for the legacy Promtail setup reference.

Metrics Collection

Application Metrics

TMI tracks performance metrics internally through api/performance_monitor.go.

HTTP Metrics

Key metrics to track:

Request rates (requests per second)
Response times (P50, P95, P99 percentiles)
Error rates (4xx and 5xx responses)
Request and response sizes
Concurrent request count

WebSocket Metrics

Track real-time collaboration health:

Active WebSocket connections
Connection establishment rate
Message throughput (messages per second)
Connection duration
WebSocket errors and disconnections

For details on the WebSocket protocol, see WebSocket-API-Reference.

Business Metrics

Monitor feature usage:

User activity (daily and monthly active users)
Threat model creation rate
Diagram creation and editing activity
Collaboration session counts
API client integration health

Database Metrics

PostgreSQL Monitoring

-- Active connections
SELECT count(*) FROM pg_stat_activity;

-- Long-running queries (over 5 minutes)
SELECT
  pid,
  now() - query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE (now() - query_start) > interval '5 minutes';

-- Table sizes
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- Index usage statistics
SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan,
  idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

-- Database size and growth
SELECT
  pg_database.datname,
  pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;

For more on database administration, see Database-Operations.

Redis Monitoring

# Memory usage and stats
redis-cli -h redis-host -a password info memory

# Key distribution by pattern
redis-cli -h redis-host -a password --scan --pattern "cache:*" | wc -l
redis-cli -h redis-host -a password --scan --pattern "session:*" | wc -l

# Cache hit rate
redis-cli -h redis-host -a password info stats | grep -E "keyspace_hits|keyspace_misses"

# Slow queries
redis-cli -h redis-host -a password slowlog get 10

# Client connections
redis-cli -h redis-host -a password client list

System Metrics

Monitor the following infrastructure resources:

Metric	What to Watch
CPU Utilization	Overall and per-core usage
Memory Usage	Application memory, available memory, swap usage
Disk I/O	Read/write operations, disk latency
Network	Bandwidth utilization, connection counts
File Descriptors	Open file descriptor count

Log Aggregation

Structured Logging

TMI uses structured JSON logging through the internal/slogging package.

Log Configuration

# Configuration options from internal/config/config.go (LoggingConfig struct)
logging:
  level: "info"                          # debug, info, warn, error
  is_dev: true                           # Development mode flag (default: true)
  is_test: false                         # Test mode flag (default: false)
  log_dir: "logs"                        # Default: "logs"
  max_age_days: 7                        # Log retention (default: 7)
  max_size_mb: 100                       # Max file size (default: 100)
  max_backups: 10                        # Number of rotated files (default: 10)
  also_log_to_console: true              # Dual logging (default: true)
  log_api_requests: false                # Request logging
  log_api_responses: false               # Response logging
  log_websocket_messages: false          # WebSocket message logging
  redact_auth_tokens: false              # Security redaction of auth tokens
  suppress_unauthenticated_logs: true    # Suppress logs for unauthenticated requests (default: true)

For the complete set of configuration options, see Configuration-Reference.

Log Categories

Category	Description
Application Logs	Business logic events
Access Logs	HTTP request and response records
Security Logs	Authentication and authorization events
Error Logs	Exceptions and error conditions
Performance Logs	Request timing and resource usage

Promtail Log Collection (Legacy)

TMI documentation includes a containerized Promtail setup for shipping logs to Grafana Cloud or Loki.

Important: Promtail reached End-of-Life on March 2, 2026. Grafana Alloy is the recommended replacement for new deployments. The Promtail container setup documentation is retained for reference at docs/migrated/developer/setup/promtail-container.md. See also TMI-Promtail-Logger.

Starting Promtail

The Promtail Make targets (build-promtail, start-promtail) are no longer included in the project Makefile. To run Promtail manually with Docker:

# Build and run the Promtail container manually
# (see docs/migrated/developer/setup/promtail-container.md)
# Or run with explicit credentials:
LOKI_URL="https://user:[email protected]/api/prom/push" docker run ...

# Check Promtail status
docker logs promtail

Promtail Configuration

Promtail monitors these log locations:

Development: ./logs/tmi.log, ./logs/server.log
Production: /var/log/tmi/tmi.log

Configuration details are documented in docs/migrated/developer/setup/promtail-container.md.

Verifying Log Collection

# Confirm Promtail is collecting logs
docker logs promtail 2>&1 | grep "Adding target"

# Expected output:
# level=info msg="Adding target" key="/logs/tmi.log:..."
# level=info msg="Adding target" key="/var/log/tmi/tmi.log:..."

# Check for errors
docker logs promtail 2>&1 | grep -i error

ELK Stack Integration

You can use Elasticsearch, Logstash, and Kibana as an alternative log aggregation stack.

Logstash Configuration

# logstash.conf
input {
  file {
    path => "/var/log/tmi/*.log"
    start_position => "beginning"
    codec => json
  }
}

filter {
  if [level] == "error" {
    mutate {
      add_tag => ["error"]
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "tmi-logs-%{+YYYY.MM.dd}"
  }
}

Querying Logs in Kibana

Common queries:

# All errors
level: "error"

# Authentication failures
level: "error" AND message: "authentication"

# Slow requests (over 2 seconds)
duration_ms: >2000

# Specific user activity
user_email: "[email protected]"

# WebSocket events
message: "websocket"

Health Checks

Service Health Endpoints

API Health Check

The root endpoint (/) provides comprehensive health information:

# The root endpoint returns API info with health status
curl https://tmi.example.com/

# Response structure:
{
  "status": {
    "code": "ok",              # "ok" when healthy, "degraded" when issues detected, "error" on critical failure
    "time": "2025-01-24T12:00:00Z"
  },
  "service": {
    "name": "TMI",
    "build": "1.3.2-abc1234"   # format: version[-prerelease][+commit]
  },
  "api": {
    "version": "1.4.0",        # from OpenAPI spec, follows semver
    "specification": "https://github.com/ericfitz/tmi/blob/main/api-schema/tmi-openapi.json"
  },
  "operator": {                # optional, present only if configured by operator
    "name": "...",
    "contact": "..."
  }
}

# When "degraded", health details are included:
# "health": {
#   "database": { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 3, "message": "..." },
#   "redis":    { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 1, "message": "..." }
# }

# OAuth provider check
curl https://tmi.example.com/oauth2/providers
# Response lists the enabled providers

Automated Health Monitoring

Create a health check script:

#!/bin/bash
# health-check.sh

HEALTH_URL="https://tmi.example.com/"
LOG_FILE="/var/log/tmi/health-check.log"

RESPONSE=$(curl -s "$HEALTH_URL")
STATUS=$(echo "$RESPONSE" | jq -r '.status.code')

if [ "$STATUS" = "ok" ]; then
    echo "$(date): TMI server is healthy" >> "$LOG_FILE"
    exit 0
else
    echo "$(date): TMI server status: $STATUS" >> "$LOG_FILE"
    # Send alert
    curl -X POST https://alerts.example.com/webhook \
      -d '{"service": "tmi", "status": "'"$STATUS"'"}'
    exit 1
fi

Schedule the script with cron:

# Check every 5 minutes
*/5 * * * * /usr/local/bin/health-check.sh

Kubernetes Probes

If you are running TMI on Kubernetes, configure liveness and readiness probes:

livenessProbe:
  httpGet:
    path: /
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2

For container deployment details, see OCI-Container-Deployment.

Alerting Configuration

Alert Categories

Critical Alerts (Immediate Response)

Alert	Trigger Condition
Service Down	TMI server unavailable
Database Failure	PostgreSQL connection failures
Authentication Outage	OAuth provider failures
High Error Rate	>5% error rate sustained for 5+ minutes
Resource Exhaustion	>90% CPU or memory usage

Warning Alerts (Monitored Response)

Alert	Trigger Condition
Performance Degradation	Response times >2x baseline
Cache Issues	Redis connection problems or high miss rate
Storage Issues	Disk usage >80%
Backup Failures	Database backup failures
Integration Issues	Client integration problems

Info Alerts (Awareness Only)

Capacity planning: resource usage trends
Performance trends: gradual performance changes
Usage patterns: user activity changes
Security events: unusual authentication patterns
Maintenance reminders: certificate renewal, updates

Alert Examples

Service Unavailable

# Prometheus AlertManager example
- alert: TMIServiceDown
  expr: up{job="tmi-server"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "TMI service is down"
    description: "TMI server has been down for more than 2 minutes"

High Error Rate

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value | humanizePercentage }}"

Database Connection Failure

- alert: DatabaseConnectionFailure
  expr: postgresql_up{job="postgres"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Database connection failure"
    description: "Cannot connect to PostgreSQL database"

Notification Channels

Configure multiple notification channels for comprehensive coverage:

Channel	Use Case
Email	Critical alerts
Slack / Teams	Team notifications
PagerDuty / OpsGenie	On-call escalation
Webhooks	Custom integrations (see Webhook-Integration)

Performance Monitoring

Application Performance

Response Time Monitoring

Track response time percentiles against the following targets:

Percentile	Target
P50 (Median)	<100ms
P95	<500ms
P99	<1000ms
P99.9	Track for outlier detection

Throughput Monitoring

Monitor requests per second:

Baseline throughput under normal load
Peak throughput capacity
Sustained throughput over time

For additional performance tuning guidance, see Performance-and-Scaling.

Resource Usage

Track application resource consumption:

# For a systemd service
systemctl status tmi

# For a containerized deployment
docker stats tmi-server

# For Kubernetes
kubectl top pod -n tmi

Database Performance

Query Performance Analysis

-- Enable the pg_stat_statements extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find slow queries (over 100ms average)
SELECT
  query,
  mean_time,
  calls,
  total_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 20;

-- Query performance by table
SELECT
  schemaname,
  tablename,
  seq_scan,
  idx_scan,
  n_tup_ins,
  n_tup_upd,
  n_tup_del
FROM pg_stat_user_tables
ORDER BY seq_scan DESC;

Connection Pool Monitoring

Monitor database connection usage:

-- Active connections by state
SELECT
  state,
  count(*)
FROM pg_stat_activity
GROUP BY state;

-- Long-running transactions (over 1 minute)
SELECT
  pid,
  now() - xact_start AS duration,
  state,
  query
FROM pg_stat_activity
WHERE xact_start < now() - interval '1 minute'
ORDER BY duration DESC;

Cache Performance

Redis Performance Metrics

# Cache hit rate calculation
redis-cli -h redis-host -a password info stats | \
  awk '/keyspace_hits|keyspace_misses/ {
    split($0,a,":");
    if ($1 ~ /hits/) hits=a[2];
    if ($1 ~ /misses/) misses=a[2]
  }
  END {
    total=hits+misses;
    rate=(hits/total)*100;
    printf "Hit Rate: %.2f%%\n", rate
  }'

# Monitor cache latency
redis-cli -h redis-host -a password --latency-history

# Check slow commands
redis-cli -h redis-host -a password slowlog get 10

Security Monitoring

Authentication Events

Monitor authentication and authorization activity:

# View authentication logs
tail -f /var/log/tmi/tmi.log | grep -E "authentication|authorization"

# Count failed login attempts
grep "authentication failed" /var/log/tmi/tmi.log | wc -l

# Identify suspicious activity (group failures by source)
grep "authentication failed" /var/log/tmi/tmi.log | \
  awk '{print $NF}' | sort | uniq -c | sort -rn

For complete security monitoring procedures, see Security-Operations.

Security Alerts

Set up alerts for the following security events:

Failed authentication attempts (more than 5 in 5 minutes)
Unauthorized access attempts
Suspicious API usage patterns
Certificate expiration warnings
Unusual data access patterns

See also Security-Best-Practices for hardening recommendations.

Dashboards

Grafana Dashboard Examples

System Overview Dashboard

Include the following panels:

Service uptime percentage
Request rate (requests per second)
Response time percentiles
Error rate percentage
Active users
Database connection count
Redis memory usage
CPU and memory utilization

Database Dashboard

Include the following panels:

Connection count over time
Query performance metrics
Table sizes
Index usage
Replication lag (if applicable)
Database size growth

Application Dashboard

Include the following panels:

HTTP request rate by endpoint
WebSocket connection count
User activity (threat models and diagrams created)
API error rates by endpoint
OAuth authentication success rate

Troubleshooting Monitoring Issues

Metrics Not Appearing

Symptom: Metrics are not showing in your monitoring system.

Note: TMI does not expose a /metrics endpoint natively. You need to configure external metrics collection.

Steps to check:

# Verify that the TMI root endpoint is responding
curl http://localhost:8080/

# Check Prometheus scrape configuration and targets
curl http://prometheus:9090/api/v1/targets

Log Collection Failing

Symptom: Logs are not appearing in your log aggregation system.

For Promtail:

# Check the Promtail container status
docker logs promtail

# Verify that log files exist and are readable
ls -la /var/log/tmi/

# Check the Promtail configuration
docker exec promtail cat /tmp/promtail-config.yaml

For ELK:

# Check Logstash status
systemctl status logstash

# Test Elasticsearch connectivity
curl http://elasticsearch:9200/_cluster/health

# Check the Logstash pipeline
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty'

Alerts Not Firing

Steps to check:

# Verify AlertManager configuration
curl http://alertmanager:9093/api/v2/status

# Check alert rules
curl http://prometheus:9090/api/v1/rules

# Test notification channels by sending a test alert through your webhook or email

Best Practices

Monitoring Checklist

Health checks configured and running
Metrics collection enabled
Log aggregation configured
Critical alerts defined and tested
Dashboards created and shared with the team
Alert notification channels tested
Runbooks created for common issues
On-call rotation established
Regular review of monitoring data scheduled
Capacity planning based on trends

Retention Policies

Configure appropriate retention periods:

Data Type	Retention
Metrics	30-90 days (high-resolution), 1 year (aggregated)
Logs	30-90 days (compliance dependent)
Alerts	90 days of alert history
Dashboards	Version-controlled in Git

Security Considerations

Protect monitoring endpoints with authentication
Encrypt metrics and log data in transit
Sanitize logs to remove sensitive data (see the redact_auth_tokens configuration option)
Restrict access to monitoring dashboards
Audit monitoring system access

Additional Resources

Promtail Container Setup -- Detailed Promtail configuration (note: Promtail is EOL; use Grafana Alloy for new deployments)
Prometheus Documentation
Grafana Documentation
PostgreSQL Monitoring

Monitoring and Health - ericfitz/tmi GitHub Wiki

Monitoring and Health

Overview

Quick Health Checks

Server Health

Database Health

Redis Health

Monitoring Architecture

Observability Stack

Key Components

Metrics Collection

Application Metrics

HTTP Metrics

WebSocket Metrics

Business Metrics

Database Metrics

PostgreSQL Monitoring

Redis Monitoring

System Metrics

Log Aggregation

Structured Logging

Log Configuration

Log Categories

Promtail Log Collection (Legacy)

Starting Promtail

Promtail Configuration

Verifying Log Collection

ELK Stack Integration

Logstash Configuration

Querying Logs in Kibana

Health Checks

Service Health Endpoints

API Health Check

Automated Health Monitoring

Kubernetes Probes

Alerting Configuration

Alert Categories

Critical Alerts (Immediate Response)

Warning Alerts (Monitored Response)

Info Alerts (Awareness Only)

Alert Examples

Service Unavailable

High Error Rate

Database Connection Failure

Notification Channels

Performance Monitoring

Application Performance

Response Time Monitoring

Throughput Monitoring

Resource Usage

Database Performance

Query Performance Analysis

Connection Pool Monitoring

Cache Performance

Redis Performance Metrics

Security Monitoring

Authentication Events

Security Alerts

Dashboards

Grafana Dashboard Examples

System Overview Dashboard

Database Dashboard

Application Dashboard

Troubleshooting Monitoring Issues

Metrics Not Appearing

Log Collection Failing

Alerts Not Firing

Best Practices

Monitoring Checklist

Retention Policies

Security Considerations

Related Documentation

Additional Resources