Monitoring and Health - ericfitz/tmi GitHub Wiki
Monitoring and Health
This guide covers health checks, metrics collection, log aggregation, alerting, and performance and security monitoring for TMI deployments.
Overview
Effective monitoring is critical for maintaining TMI's availability, performance, and security. This guide provides practical procedures for:
- Health checks and availability monitoring
- Metrics collection and visualization
- Log aggregation and analysis
- Alerting configuration
- Performance monitoring
- Security event monitoring
Quick Health Checks
Server Health
# Basic health check (the root endpoint returns API info with health status)
curl https://tmi.example.com/
# Expected response structure:
{
"status": {
"code": "ok", # "ok", "degraded", or "error"
"time": "2025-01-24T..."
},
"service": {
"name": "TMI",
"build": "1.3.2-abc1234" # format: version[-prerelease][+commit]
},
"api": {
"version": "1.4.0", # from OpenAPI spec, follows semver
"specification": "https://github.com/ericfitz/tmi/blob/main/api-schema/tmi-openapi.json"
},
"operator": { # optional, present only if configured
"name": "Acme Corp",
"contact": "[email protected]"
}
}
# When status is "degraded", the response includes health details:
# "health": {
# "database": { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 3, "message": "..." },
# "redis": { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 1, "message": "..." }
# }
# Check OAuth providers
curl https://tmi.example.com/oauth2/providers
Database Health
# PostgreSQL connection test
psql -h postgres-host -U tmi_user -d tmi -c "SELECT 1"
# Check database size
psql -h postgres-host -U tmi_user -d tmi -c "
SELECT pg_size_pretty(pg_database_size('tmi'))"
# Check table row counts
psql -h postgres-host -U tmi_user -d tmi -c "
SELECT schemaname, tablename, n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC"
Redis Health
# Connection test
redis-cli -h redis-host -p 6379 -a password ping
# Expected: PONG
# Check memory usage
redis-cli -h redis-host -a password info memory | grep used_memory_human
# Check key count
redis-cli -h redis-host -a password DBSIZE
# Check cache hit rate
redis-cli -h redis-host -a password info stats | grep keyspace_hits
Monitoring Architecture
Observability Stack
[TMI Application] --> [Metrics Collection] --> [Time Series DB]
--> [Log Aggregation] --> [Log Storage]
--> [Health Checks] --> [Alerting System]
Key Components
| Component | Purpose | Recommended Tool |
|---|---|---|
| Metrics Collection | Application and system metrics | Prometheus |
| Log Aggregation | Centralized logging | Grafana Alloy/Loki (recommended), ELK Stack (alternative) |
| Health Monitoring | Service availability and performance | Built-in health endpoint |
| Alerting | Proactive issue notification | Prometheus AlertManager |
| Dashboards | Visualization | Grafana |
Note: Promtail reached End-of-Life on March 2, 2026. Use Grafana Alloy for new deployments. See TMI-Promtail-Logger for the legacy Promtail setup reference.
Metrics Collection
Application Metrics
TMI tracks performance metrics internally through api/performance_monitor.go.
HTTP Metrics
Key metrics to track:
- Request rates (requests per second)
- Response times (P50, P95, P99 percentiles)
- Error rates (4xx and 5xx responses)
- Request and response sizes
- Concurrent request count
WebSocket Metrics
Track real-time collaboration health:
- Active WebSocket connections
- Connection establishment rate
- Message throughput (messages per second)
- Connection duration
- WebSocket errors and disconnections
For details on the WebSocket protocol, see WebSocket-API-Reference.
Business Metrics
Monitor feature usage:
- User activity (daily and monthly active users)
- Threat model creation rate
- Diagram creation and editing activity
- Collaboration session counts
- API client integration health
Database Metrics
PostgreSQL Monitoring
-- Active connections
SELECT count(*) FROM pg_stat_activity;
-- Long-running queries (over 5 minutes)
SELECT
pid,
now() - query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE (now() - query_start) > interval '5 minutes';
-- Table sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
-- Index usage statistics
SELECT
schemaname,
tablename,
indexname,
idx_scan,
idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
-- Database size and growth
SELECT
pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;
For more on database administration, see Database-Operations.
Redis Monitoring
# Memory usage and stats
redis-cli -h redis-host -a password info memory
# Key distribution by pattern
redis-cli -h redis-host -a password --scan --pattern "cache:*" | wc -l
redis-cli -h redis-host -a password --scan --pattern "session:*" | wc -l
# Cache hit rate
redis-cli -h redis-host -a password info stats | grep -E "keyspace_hits|keyspace_misses"
# Slow queries
redis-cli -h redis-host -a password slowlog get 10
# Client connections
redis-cli -h redis-host -a password client list
System Metrics
Monitor the following infrastructure resources:
| Metric | What to Watch |
|---|---|
| CPU Utilization | Overall and per-core usage |
| Memory Usage | Application memory, available memory, swap usage |
| Disk I/O | Read/write operations, disk latency |
| Network | Bandwidth utilization, connection counts |
| File Descriptors | Open file descriptor count |
Log Aggregation
Structured Logging
TMI uses structured JSON logging through the internal/slogging package.
Log Configuration
# Configuration options from internal/config/config.go (LoggingConfig struct)
logging:
level: "info" # debug, info, warn, error
is_dev: true # Development mode flag (default: true)
is_test: false # Test mode flag (default: false)
log_dir: "logs" # Default: "logs"
max_age_days: 7 # Log retention (default: 7)
max_size_mb: 100 # Max file size (default: 100)
max_backups: 10 # Number of rotated files (default: 10)
also_log_to_console: true # Dual logging (default: true)
log_api_requests: false # Request logging
log_api_responses: false # Response logging
log_websocket_messages: false # WebSocket message logging
redact_auth_tokens: false # Security redaction of auth tokens
suppress_unauthenticated_logs: true # Suppress logs for unauthenticated requests (default: true)
For the complete set of configuration options, see Configuration-Reference.
Log Categories
| Category | Description |
|---|---|
| Application Logs | Business logic events |
| Access Logs | HTTP request and response records |
| Security Logs | Authentication and authorization events |
| Error Logs | Exceptions and error conditions |
| Performance Logs | Request timing and resource usage |
Promtail Log Collection (Legacy)
TMI documentation includes a containerized Promtail setup for shipping logs to Grafana Cloud or Loki.
Important: Promtail reached End-of-Life on March 2, 2026. Grafana Alloy is the recommended replacement for new deployments. The Promtail container setup documentation is retained for reference at
docs/migrated/developer/setup/promtail-container.md. See also TMI-Promtail-Logger.
Starting Promtail
The Promtail Make targets (build-promtail, start-promtail) are no longer included in the project Makefile. To run Promtail manually with Docker:
# Build and run the Promtail container manually
# (see docs/migrated/developer/setup/promtail-container.md)
# Or run with explicit credentials:
LOKI_URL="https://user:[email protected]/api/prom/push" docker run ...
# Check Promtail status
docker logs promtail
Promtail Configuration
Promtail monitors these log locations:
- Development:
./logs/tmi.log,./logs/server.log - Production:
/var/log/tmi/tmi.log
Configuration details are documented in docs/migrated/developer/setup/promtail-container.md.
Verifying Log Collection
# Confirm Promtail is collecting logs
docker logs promtail 2>&1 | grep "Adding target"
# Expected output:
# level=info msg="Adding target" key="/logs/tmi.log:..."
# level=info msg="Adding target" key="/var/log/tmi/tmi.log:..."
# Check for errors
docker logs promtail 2>&1 | grep -i error
ELK Stack Integration
You can use Elasticsearch, Logstash, and Kibana as an alternative log aggregation stack.
Logstash Configuration
# logstash.conf
input {
file {
path => "/var/log/tmi/*.log"
start_position => "beginning"
codec => json
}
}
filter {
if [level] == "error" {
mutate {
add_tag => ["error"]
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "tmi-logs-%{+YYYY.MM.dd}"
}
}
Querying Logs in Kibana
Common queries:
# All errors
level: "error"
# Authentication failures
level: "error" AND message: "authentication"
# Slow requests (over 2 seconds)
duration_ms: >2000
# Specific user activity
user_email: "[email protected]"
# WebSocket events
message: "websocket"
Health Checks
Service Health Endpoints
API Health Check
The root endpoint (/) provides comprehensive health information:
# The root endpoint returns API info with health status
curl https://tmi.example.com/
# Response structure:
{
"status": {
"code": "ok", # "ok" when healthy, "degraded" when issues detected, "error" on critical failure
"time": "2025-01-24T12:00:00Z"
},
"service": {
"name": "TMI",
"build": "1.3.2-abc1234" # format: version[-prerelease][+commit]
},
"api": {
"version": "1.4.0", # from OpenAPI spec, follows semver
"specification": "https://github.com/ericfitz/tmi/blob/main/api-schema/tmi-openapi.json"
},
"operator": { # optional, present only if configured by operator
"name": "...",
"contact": "..."
}
}
# When "degraded", health details are included:
# "health": {
# "database": { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 3, "message": "..." },
# "redis": { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 1, "message": "..." }
# }
# OAuth provider check
curl https://tmi.example.com/oauth2/providers
# Response lists the enabled providers
Automated Health Monitoring
Create a health check script:
#!/bin/bash
# health-check.sh
HEALTH_URL="https://tmi.example.com/"
LOG_FILE="/var/log/tmi/health-check.log"
RESPONSE=$(curl -s "$HEALTH_URL")
STATUS=$(echo "$RESPONSE" | jq -r '.status.code')
if [ "$STATUS" = "ok" ]; then
echo "$(date): TMI server is healthy" >> "$LOG_FILE"
exit 0
else
echo "$(date): TMI server status: $STATUS" >> "$LOG_FILE"
# Send alert
curl -X POST https://alerts.example.com/webhook \
-d '{"service": "tmi", "status": "'"$STATUS"'"}'
exit 1
fi
Schedule the script with cron:
# Check every 5 minutes
*/5 * * * * /usr/local/bin/health-check.sh
Kubernetes Probes
If you are running TMI on Kubernetes, configure liveness and readiness probes:
livenessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2
For container deployment details, see OCI-Container-Deployment.
Alerting Configuration
Alert Categories
Critical Alerts (Immediate Response)
| Alert | Trigger Condition |
|---|---|
| Service Down | TMI server unavailable |
| Database Failure | PostgreSQL connection failures |
| Authentication Outage | OAuth provider failures |
| High Error Rate | >5% error rate sustained for 5+ minutes |
| Resource Exhaustion | >90% CPU or memory usage |
Warning Alerts (Monitored Response)
| Alert | Trigger Condition |
|---|---|
| Performance Degradation | Response times >2x baseline |
| Cache Issues | Redis connection problems or high miss rate |
| Storage Issues | Disk usage >80% |
| Backup Failures | Database backup failures |
| Integration Issues | Client integration problems |
Info Alerts (Awareness Only)
- Capacity planning: resource usage trends
- Performance trends: gradual performance changes
- Usage patterns: user activity changes
- Security events: unusual authentication patterns
- Maintenance reminders: certificate renewal, updates
Alert Examples
Service Unavailable
# Prometheus AlertManager example
- alert: TMIServiceDown
expr: up{job="tmi-server"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "TMI service is down"
description: "TMI server has been down for more than 2 minutes"
High Error Rate
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
Database Connection Failure
- alert: DatabaseConnectionFailure
expr: postgresql_up{job="postgres"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database connection failure"
description: "Cannot connect to PostgreSQL database"
Notification Channels
Configure multiple notification channels for comprehensive coverage:
| Channel | Use Case |
|---|---|
| Critical alerts | |
| Slack / Teams | Team notifications |
| PagerDuty / OpsGenie | On-call escalation |
| Webhooks | Custom integrations (see Webhook-Integration) |
Performance Monitoring
Application Performance
Response Time Monitoring
Track response time percentiles against the following targets:
| Percentile | Target |
|---|---|
| P50 (Median) | <100ms |
| P95 | <500ms |
| P99 | <1000ms |
| P99.9 | Track for outlier detection |
Throughput Monitoring
Monitor requests per second:
- Baseline throughput under normal load
- Peak throughput capacity
- Sustained throughput over time
For additional performance tuning guidance, see Performance-and-Scaling.
Resource Usage
Track application resource consumption:
# For a systemd service
systemctl status tmi
# For a containerized deployment
docker stats tmi-server
# For Kubernetes
kubectl top pod -n tmi
Database Performance
Query Performance Analysis
-- Enable the pg_stat_statements extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Find slow queries (over 100ms average)
SELECT
query,
mean_time,
calls,
total_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 20;
-- Query performance by table
SELECT
schemaname,
tablename,
seq_scan,
idx_scan,
n_tup_ins,
n_tup_upd,
n_tup_del
FROM pg_stat_user_tables
ORDER BY seq_scan DESC;
Connection Pool Monitoring
Monitor database connection usage:
-- Active connections by state
SELECT
state,
count(*)
FROM pg_stat_activity
GROUP BY state;
-- Long-running transactions (over 1 minute)
SELECT
pid,
now() - xact_start AS duration,
state,
query
FROM pg_stat_activity
WHERE xact_start < now() - interval '1 minute'
ORDER BY duration DESC;
Cache Performance
Redis Performance Metrics
# Cache hit rate calculation
redis-cli -h redis-host -a password info stats | \
awk '/keyspace_hits|keyspace_misses/ {
split($0,a,":");
if ($1 ~ /hits/) hits=a[2];
if ($1 ~ /misses/) misses=a[2]
}
END {
total=hits+misses;
rate=(hits/total)*100;
printf "Hit Rate: %.2f%%\n", rate
}'
# Monitor cache latency
redis-cli -h redis-host -a password --latency-history
# Check slow commands
redis-cli -h redis-host -a password slowlog get 10
Security Monitoring
Authentication Events
Monitor authentication and authorization activity:
# View authentication logs
tail -f /var/log/tmi/tmi.log | grep -E "authentication|authorization"
# Count failed login attempts
grep "authentication failed" /var/log/tmi/tmi.log | wc -l
# Identify suspicious activity (group failures by source)
grep "authentication failed" /var/log/tmi/tmi.log | \
awk '{print $NF}' | sort | uniq -c | sort -rn
For complete security monitoring procedures, see Security-Operations.
Security Alerts
Set up alerts for the following security events:
- Failed authentication attempts (more than 5 in 5 minutes)
- Unauthorized access attempts
- Suspicious API usage patterns
- Certificate expiration warnings
- Unusual data access patterns
See also Security-Best-Practices for hardening recommendations.
Dashboards
Grafana Dashboard Examples
System Overview Dashboard
Include the following panels:
- Service uptime percentage
- Request rate (requests per second)
- Response time percentiles
- Error rate percentage
- Active users
- Database connection count
- Redis memory usage
- CPU and memory utilization
Database Dashboard
Include the following panels:
- Connection count over time
- Query performance metrics
- Table sizes
- Index usage
- Replication lag (if applicable)
- Database size growth
Application Dashboard
Include the following panels:
- HTTP request rate by endpoint
- WebSocket connection count
- User activity (threat models and diagrams created)
- API error rates by endpoint
- OAuth authentication success rate
Troubleshooting Monitoring Issues
Metrics Not Appearing
Symptom: Metrics are not showing in your monitoring system.
Note: TMI does not expose a /metrics endpoint natively. You need to configure external metrics collection.
Steps to check:
# Verify that the TMI root endpoint is responding
curl http://localhost:8080/
# Check Prometheus scrape configuration and targets
curl http://prometheus:9090/api/v1/targets
Log Collection Failing
Symptom: Logs are not appearing in your log aggregation system.
For Promtail:
# Check the Promtail container status
docker logs promtail
# Verify that log files exist and are readable
ls -la /var/log/tmi/
# Check the Promtail configuration
docker exec promtail cat /tmp/promtail-config.yaml
For ELK:
# Check Logstash status
systemctl status logstash
# Test Elasticsearch connectivity
curl http://elasticsearch:9200/_cluster/health
# Check the Logstash pipeline
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty'
Alerts Not Firing
Steps to check:
# Verify AlertManager configuration
curl http://alertmanager:9093/api/v2/status
# Check alert rules
curl http://prometheus:9090/api/v1/rules
# Test notification channels by sending a test alert through your webhook or email
Best Practices
Monitoring Checklist
- Health checks configured and running
- Metrics collection enabled
- Log aggregation configured
- Critical alerts defined and tested
- Dashboards created and shared with the team
- Alert notification channels tested
- Runbooks created for common issues
- On-call rotation established
- Regular review of monitoring data scheduled
- Capacity planning based on trends
Retention Policies
Configure appropriate retention periods:
| Data Type | Retention |
|---|---|
| Metrics | 30-90 days (high-resolution), 1 year (aggregated) |
| Logs | 30-90 days (compliance dependent) |
| Alerts | 90 days of alert history |
| Dashboards | Version-controlled in Git |
Security Considerations
- Protect monitoring endpoints with authentication
- Encrypt metrics and log data in transit
- Sanitize logs to remove sensitive data (see the
redact_auth_tokensconfiguration option) - Restrict access to monitoring dashboards
- Audit monitoring system access
Related Documentation
- Database-Operations -- Database management and monitoring
- Security-Operations -- Security monitoring and auditing
- Performance-and-Scaling -- Performance tuning guidance
- Post-Deployment -- Initial deployment verification
- Common-Issues -- Troubleshooting common problems
- Debugging-Guide -- Diagnostic procedures
Additional Resources
- Promtail Container Setup -- Detailed Promtail configuration (note: Promtail is EOL; use Grafana Alloy for new deployments)
- Prometheus Documentation
- Grafana Documentation
- PostgreSQL Monitoring