Maintenance Tasks - ericfitz/tmi GitHub Wiki

Maintenance Tasks

This guide covers routine maintenance tasks for TMI operations, including software updates, certificate renewal, log rotation, and scheduled maintenance windows.

Overview

Regular maintenance ensures TMI continues to operate reliably and securely. This guide covers:

Daily, weekly, and monthly maintenance tasks
Automated maintenance procedures
Certificate renewal
Log rotation and cleanup
Software updates
Backup verification
Database maintenance

Maintenance Schedule

Daily Tasks

Automated:

Health check monitoring (continuous)
Log collection and aggregation
Backup execution (2 AM daily)
Metric collection and alerting

Manual (if needed):

Review critical alerts
Check service status
Monitor error rates

Weekly Tasks

Review backup integrity
Check certificate expiration dates
Review security alerts and logs
Check disk space usage
Review application logs for errors
Monitor performance trends

Monthly Tasks

Apply security updates
Review and optimize database performance
Clean up old logs and backups
Review user access and permissions
Test disaster recovery procedures
Review capacity and scaling needs
Update documentation

Quarterly Tasks

Full security audit
Disaster recovery test
Performance benchmarking
Dependency updates
Review and update runbooks
Team training updates

Automated Maintenance

Systemd Timers

Create automated maintenance tasks with systemd timers.

Daily Backup Timer

Create /etc/systemd/system/tmi-backup.service:

[Unit]
Description=TMI Database Backup
After=network.target postgresql.service

[Service]
Type=oneshot
User=tmi
ExecStart=/usr/local/bin/backup-tmi.sh

Create /etc/systemd/system/tmi-backup.timer:

[Unit]
Description=TMI Daily Backup Timer
Requires=tmi-backup.service

[Timer]
OnCalendar=daily
OnCalendar=02:00
Persistent=true

[Install]
WantedBy=timers.target

Enable and start:

sudo systemctl enable tmi-backup.timer
sudo systemctl start tmi-backup.timer

# Check timer status
systemctl list-timers tmi-backup.timer

Weekly Maintenance Timer

Create /etc/systemd/system/tmi-maintenance.service:

[Unit]
Description=TMI Weekly Maintenance
After=network.target

[Service]
Type=oneshot
User=tmi
ExecStart=/usr/local/bin/tmi-weekly-maintenance.sh

Create /etc/systemd/system/tmi-maintenance.timer:

[Unit]
Description=TMI Weekly Maintenance Timer
Requires=tmi-maintenance.service

[Timer]
OnCalendar=weekly
OnCalendar=Sun 03:00
Persistent=true

[Install]
WantedBy=timers.target

Cron Jobs

Alternative to systemd timers:

# Edit crontab
crontab -e

# Daily backup at 2 AM
0 2 * * * /usr/local/bin/backup-tmi.sh

# Weekly maintenance on Sunday at 3 AM
0 3 * * 0 /usr/local/bin/tmi-weekly-maintenance.sh

# Daily log rotation at midnight
0 0 * * * /usr/local/bin/rotate-tmi-logs.sh

# Certificate check daily
0 6 * * * /usr/local/bin/check-tmi-certs.sh

# Monthly cleanup on first day at 4 AM
0 4 1 * * /usr/local/bin/tmi-monthly-cleanup.sh

Certificate Management

Certificate Renewal

Let's Encrypt (Automatic)

Let's Encrypt certificates renew automatically:

# Check certbot timer status
systemctl status certbot.timer

# Test renewal
sudo certbot renew --dry-run

# Force renewal (if needed)
sudo certbot renew --force-renewal

# Check renewal logs
sudo tail -f /var/log/letsencrypt/letsencrypt.log

Manual Certificate Renewal

Create renewal script /usr/local/bin/renew-tmi-cert.sh:

#!/bin/bash
# TMI Certificate Renewal Script

CERT_DIR="/etc/tmi/certs"
BACKUP_DIR="/var/backups/tmi/certs"
LOG_FILE="/var/log/tmi/cert-renewal.log"
DAYS_BEFORE_EXPIRY=30

# Function to check certificate expiration
check_expiry() {
    local cert_file=$1
    local expiry_date=$(openssl x509 -enddate -noout -in "$cert_file" | cut -d= -f2)
    local expiry_epoch=$(date -d "$expiry_date" +%s)
    local now_epoch=$(date +%s)
    local days_remaining=$(( ($expiry_epoch - $now_epoch) / 86400 ))
    echo $days_remaining
}

# Check if renewal is needed
days_left=$(check_expiry "$CERT_DIR/server.crt")

if [ $days_left -le $DAYS_BEFORE_EXPIRY ]; then
    echo "$(date): Certificate expires in $days_left days, renewing..." >> $LOG_FILE

    # Backup old certificates
    mkdir -p $BACKUP_DIR
    cp $CERT_DIR/server.crt $BACKUP_DIR/server.crt.$(date +%Y%m%d)
    cp $CERT_DIR/server.key $BACKUP_DIR/server.key.$(date +%Y%m%d)

    # Generate new certificate (modify for your CA/provider)
    openssl req -x509 -newkey rsa:4096 -nodes \
        -keyout $CERT_DIR/server.key.new \
        -out $CERT_DIR/server.crt.new \
        -days 365 \
        -subj "/CN=tmi.example.com"

    # Install new certificates
    mv $CERT_DIR/server.key.new $CERT_DIR/server.key
    mv $CERT_DIR/server.crt.new $CERT_DIR/server.crt

    # Set permissions
    chmod 600 $CERT_DIR/server.key
    chmod 644 $CERT_DIR/server.crt
    chown tmi:tmi $CERT_DIR/*

    # Restart TMI server
    systemctl restart tmi

    echo "$(date): Certificate renewal completed" >> $LOG_FILE

    # Send notification
    echo "TMI certificate renewed successfully" | \
        mail -s "Certificate Renewal Completed" [email protected]
else
    echo "$(date): Certificate valid for $days_left days, no renewal needed" >> $LOG_FILE
fi

Make executable and schedule:

chmod +x /usr/local/bin/renew-tmi-cert.sh

# Run daily
crontab -e
0 6 * * * /usr/local/bin/renew-tmi-cert.sh

Certificate Monitoring

Create monitoring script /usr/local/bin/check-tmi-certs.sh:

#!/bin/bash
# TMI Certificate Monitoring Script

CERT_FILE="/etc/tmi/certs/server.crt"
ALERT_DAYS=30

# Get expiration date
expiry_date=$(openssl x509 -enddate -noout -in $CERT_FILE | cut -d= -f2)
expiry_epoch=$(date -d "$expiry_date" +%s)
now_epoch=$(date +%s)
days_remaining=$(( ($expiry_epoch - $now_epoch) / 86400 ))

# Alert if expiring soon
if [ $days_remaining -le $ALERT_DAYS ]; then
    echo "WARNING: TMI certificate expires in $days_remaining days" | \
        mail -s "Certificate Expiration Warning" [email protected]
fi

# Log status
echo "$(date): Certificate expires in $days_remaining days" >> /var/log/tmi/cert-check.log

Log Management

Log Rotation

TMI includes automatic log rotation, but you can configure additional rotation with logrotate.

Configure Logrotate

Create /etc/logrotate.d/tmi:

/var/log/tmi/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 0640 tmi tmi
    sharedscripts
    postrotate
        # Restart or reload if needed
        systemctl reload tmi > /dev/null 2>&1 || true
    endscript
}

Manual Log Rotation

# Force log rotation
logrotate -f /etc/logrotate.d/tmi

# Test configuration
logrotate -d /etc/logrotate.d/tmi

Log Cleanup Script

Create /usr/local/bin/cleanup-tmi-logs.sh:

#!/bin/bash
# TMI Log Cleanup Script

LOG_DIR="/var/log/tmi"
RETENTION_DAYS=90

# Delete logs older than retention period
find $LOG_DIR -name "*.log.*" -mtime +$RETENTION_DAYS -delete

# Delete compressed logs
find $LOG_DIR -name "*.gz" -mtime +$RETENTION_DAYS -delete

# Log cleanup action
echo "$(date): Cleaned up logs older than $RETENTION_DAYS days" >> $LOG_DIR/cleanup.log

# Check disk space
disk_usage=$(df -h $LOG_DIR | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $disk_usage -gt 80 ]; then
    echo "WARNING: Log directory is ${disk_usage}% full" | \
        mail -s "TMI Log Disk Space Warning" [email protected]
fi

Schedule:

# Monthly on first day
0 4 1 * * /usr/local/bin/cleanup-tmi-logs.sh

Archive Logs

Archive old logs to long-term storage:

#!/bin/bash
# Archive logs to S3/cloud storage

ARCHIVE_DIR="/var/log/tmi/archive"
S3_BUCKET="s3://my-tmi-logs"
DATE=$(date -d "last month" +%Y-%m)

# Create archive
mkdir -p $ARCHIVE_DIR
tar -czf $ARCHIVE_DIR/tmi-logs-$DATE.tar.gz \
    /var/log/tmi/*.log.* \
    --remove-files

# Upload to S3
aws s3 cp $ARCHIVE_DIR/tmi-logs-$DATE.tar.gz $S3_BUCKET/

# Remove local archive after upload
rm $ARCHIVE_DIR/tmi-logs-$DATE.tar.gz

echo "$(date): Archived and uploaded logs for $DATE" >> /var/log/tmi/archive.log

Database Maintenance

Vacuum and Analyze

Create /usr/local/bin/tmi-db-maintenance.sh:

#!/bin/bash
# TMI Database Maintenance Script

POSTGRES_HOST="postgres-host"
POSTGRES_USER="tmi_user"
POSTGRES_DB="tmi"
LOG_FILE="/var/log/tmi/db-maintenance.log"

echo "$(date): Starting database maintenance" >> $LOG_FILE

# Vacuum and analyze all tables
psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB << EOF
VACUUM ANALYZE;
EOF

if [ $? -eq 0 ]; then
    echo "$(date): Database maintenance completed successfully" >> $LOG_FILE
else
    echo "$(date): Database maintenance failed" >> $LOG_FILE
    echo "Database maintenance failed" | \
        mail -s "TMI Database Maintenance Failed" [email protected]
fi

# Check database size
db_size=$(psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -t -c \
    "SELECT pg_size_pretty(pg_database_size('$POSTGRES_DB'))")
echo "$(date): Database size: $db_size" >> $LOG_FILE

# Check table bloat
psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -c "
    SELECT
        schemaname,
        tablename,
        n_dead_tup,
        last_vacuum,
        last_autovacuum
    FROM pg_stat_user_tables
    WHERE n_dead_tup > 1000
    ORDER BY n_dead_tup DESC
    LIMIT 10" >> $LOG_FILE

Schedule weekly:

# Sunday at 3 AM
0 3 * * 0 /usr/local/bin/tmi-db-maintenance.sh

Index Maintenance

Create /usr/local/bin/tmi-reindex.sh:

#!/bin/bash
# TMI Index Maintenance (reindex fragmented indexes)

POSTGRES_HOST="postgres-host"
POSTGRES_USER="tmi_user"
POSTGRES_DB="tmi"
LOG_FILE="/var/log/tmi/reindex.log"

echo "$(date): Starting index maintenance" >> $LOG_FILE

# Reindex database (less intrusive than REINDEX DATABASE)
psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB << EOF
REINDEX DATABASE CONCURRENTLY $POSTGRES_DB;
EOF

if [ $? -eq 0 ]; then
    echo "$(date): Reindex completed successfully" >> $LOG_FILE
else
    echo "$(date): Reindex failed" >> $LOG_FILE
fi

Schedule monthly:

# First Sunday at 4 AM
0 4 1-7 * 0 /usr/local/bin/tmi-reindex.sh

Backup Maintenance

Backup Verification

Create /usr/local/bin/verify-tmi-backups.sh:

#!/bin/bash
# Verify backup integrity

BACKUP_DIR="/var/backups/postgresql/tmi"
RESTORE_TEST_DB="tmi_restore_test"
LOG_FILE="/var/log/tmi/backup-verification.log"

# Find most recent backup
LATEST_BACKUP=$(ls -t $BACKUP_DIR/tmi_*.dump | head -1)

if [ -z "$LATEST_BACKUP" ]; then
    echo "$(date): No backup found" >> $LOG_FILE
    exit 1
fi

echo "$(date): Verifying backup: $LATEST_BACKUP" >> $LOG_FILE

# Drop test database if exists
psql -U postgres -c "DROP DATABASE IF EXISTS $RESTORE_TEST_DB"

# Create test database
createdb -U postgres $RESTORE_TEST_DB

# Restore backup to test database
pg_restore -U postgres -d $RESTORE_TEST_DB $LATEST_BACKUP 2>&1 | \
    grep -v "WARNING" >> $LOG_FILE

if [ $? -eq 0 ]; then
    echo "$(date): Backup verification successful" >> $LOG_FILE

    # Basic validation queries
    table_count=$(psql -U postgres -d $RESTORE_TEST_DB -t -c "
        SELECT count(*) FROM information_schema.tables
        WHERE table_schema = 'public'")

    echo "$(date): Restored $table_count tables" >> $LOG_FILE
else
    echo "$(date): Backup verification FAILED" >> $LOG_FILE
    echo "Backup verification failed for $LATEST_BACKUP" | \
        mail -s "TMI Backup Verification Failed" [email protected]
fi

# Cleanup test database
psql -U postgres -c "DROP DATABASE $RESTORE_TEST_DB"

Schedule weekly:

# Monday at 5 AM
0 5 * * 1 /usr/local/bin/verify-tmi-backups.sh

Backup Cleanup

Create /usr/local/bin/cleanup-tmi-backups.sh:

#!/bin/bash
# Clean up old backups

BACKUP_DIR="/var/backups/postgresql/tmi"
RETENTION_DAYS=30
LOG_FILE="/var/log/tmi/backup-cleanup.log"

echo "$(date): Starting backup cleanup" >> $LOG_FILE

# Count backups before cleanup
before_count=$(ls $BACKUP_DIR/tmi_*.dump 2>/dev/null | wc -l)

# Delete old backups
find $BACKUP_DIR -name "tmi_*.dump" -mtime +$RETENTION_DAYS -delete

# Count backups after cleanup
after_count=$(ls $BACKUP_DIR/tmi_*.dump 2>/dev/null | wc -l)
deleted_count=$((before_count - after_count))

echo "$(date): Deleted $deleted_count backups older than $RETENTION_DAYS days" >> $LOG_FILE
echo "$(date): $after_count backups remaining" >> $LOG_FILE

# Check backup directory size
backup_size=$(du -sh $BACKUP_DIR | awk '{print $1}')
echo "$(date): Backup directory size: $backup_size" >> $LOG_FILE

Software Updates

Update Procedure

Review release notes for breaking changes
Test in staging environment
Schedule maintenance window
Create backup before update
Apply updates
Run smoke tests
Monitor for issues

TMI Server Update

#!/bin/bash
# Update TMI server

# 1. Stop service
systemctl stop tmi

# 2. Backup current version
cp /opt/tmi/tmi-server /opt/tmi/tmi-server.backup

# 3. Download new version
curl -L https://github.com/ericfitz/tmi/releases/download/v1.x.x/tmi-server-linux-amd64 \
    -o /opt/tmi/tmi-server.new

# 4. Verify checksum
sha256sum /opt/tmi/tmi-server.new

# 5. Install new version
mv /opt/tmi/tmi-server.new /opt/tmi/tmi-server
chmod +x /opt/tmi/tmi-server

# 6. Start service (GORM AutoMigrate runs automatically on startup)
systemctl start tmi

# 7. Verify
sleep 5
curl http://localhost:8080/

# 8. Check logs
journalctl -u tmi -n 50

Container Updates

# Pull latest images
docker compose pull

# Recreate containers with new images
docker compose up -d

# Verify
docker compose ps
docker compose logs tmi-api

Database Updates

# Update PostgreSQL (Ubuntu)
sudo apt-get update
sudo apt-get upgrade postgresql

# Update Redis
sudo apt-get update
sudo apt-get upgrade redis-server

# Restart services
sudo systemctl restart postgresql
sudo systemctl restart redis-server

Health Checks

Weekly Health Check Script

Create /usr/local/bin/tmi-weekly-health-check.sh:

#!/bin/bash
# Weekly comprehensive health check

REPORT_FILE="/tmp/tmi-health-$(date +%Y%m%d).txt"

echo "TMI Health Check Report - $(date)" > $REPORT_FILE
echo "======================================" >> $REPORT_FILE

# Service status
echo -e "\n## Service Status" >> $REPORT_FILE
systemctl status tmi | grep -E "Active|Memory|CPU" >> $REPORT_FILE

# Database health
echo -e "\n## Database Health" >> $REPORT_FILE
psql -h postgres-host -U tmi_user -d tmi -c "
    SELECT
        'Connections: ' || count(*) as info
    FROM pg_stat_activity
    UNION ALL
    SELECT
        'Database Size: ' || pg_size_pretty(pg_database_size('tmi'))
    " >> $REPORT_FILE

# Redis health
echo -e "\n## Redis Health" >> $REPORT_FILE
redis-cli -h redis-host info memory | grep used_memory_human >> $REPORT_FILE
redis-cli -h redis-host info stats | grep keyspace_hits >> $REPORT_FILE

# Disk space
echo -e "\n## Disk Space" >> $REPORT_FILE
df -h | grep -E "Filesystem|/var|/opt" >> $REPORT_FILE

# Certificate expiry
echo -e "\n## Certificate Status" >> $REPORT_FILE
cert_days=$(( ($(date -d "$(openssl x509 -enddate -noout -in /etc/tmi/certs/server.crt | cut -d= -f2)" +%s) - $(date +%s)) / 86400 ))
echo "Certificate expires in $cert_days days" >> $REPORT_FILE

# Recent errors
echo -e "\n## Recent Errors (last 24h)" >> $REPORT_FILE
grep -i error /var/log/tmi/tmi.log 2>/dev/null | tail -20 >> $REPORT_FILE

# Email report
cat $REPORT_FILE | mail -s "TMI Weekly Health Report" [email protected]

# Cleanup
rm $REPORT_FILE

Monitoring Maintenance

Clean Up Metrics

If using Prometheus, configure retention:

# prometheus.yml
global:
  retention: 30d  # Keep metrics for 30 days

# Or via command line
prometheus --storage.tsdb.retention.time=30d

Dashboard Maintenance

Review and update dashboards monthly
Archive unused dashboards
Document dashboard purpose and usage
Share dashboards with team

Documentation Maintenance

Keep operational documentation current:

Update runbooks after incidents
Document new procedures
Archive outdated documentation
Review and update quarterly

Emergency Maintenance

Unplanned Maintenance

When emergency maintenance is required:

Assess urgency: Critical vs non-critical
Notify stakeholders: Users, team, management
Create incident ticket: Track the issue
Perform maintenance: Follow emergency runbook
Verify resolution: Test functionality
Post-incident review: Learn and improve

Emergency Contact List

Maintain current contact list:

# /etc/tmi/emergency-contacts.yml
contacts:
  primary_oncall:
    name: "John Doe"
    phone: "+1-555-0100"
    email: "[email protected]"

  backup_oncall:
    name: "Jane Smith"
    phone: "+1-555-0101"
    email: "[email protected]"

  database_admin:
    name: "DB Team"
    email: "[email protected]"
    slack: "#db-team"

  security_team:
    email: "[email protected]"
    phone: "+1-555-0911"

Maintenance Windows

Scheduled Maintenance

Plan maintenance windows:

Weekly: Sunday 2-4 AM (low traffic)
Monthly: First Sunday 2-6 AM
Emergency: As needed with notification

Maintenance Notification

Template for user notification:

Subject: Scheduled Maintenance - TMI Service

Dear TMI Users,

We will be performing scheduled maintenance on the TMI service:

Date: Sunday, November 17, 2025
Time: 2:00 AM - 4:00 AM EST
Duration: Approximately 2 hours

During this time, TMI will be unavailable.

Maintenance activities:
- Security updates
- Database optimization
- Performance improvements

We apologize for any inconvenience.

Best regards,
TMI Operations Team

Maintenance Checklist

Print and use for regular maintenance:

TMI Monthly Maintenance Checklist

Date: _______________  Performed by: _______________

[ ] Review service health and uptime
[ ] Check disk space (goal: <75% usage)
[ ] Review and address security alerts
[ ] Apply security patches and updates
[ ] Verify backup integrity
[ ] Check certificate expiration (>30 days remaining)
[ ] Review application logs for errors
[ ] Optimize database (vacuum, analyze, reindex if needed)
[ ] Review performance metrics and trends
[ ] Clean up old logs and backups
[ ] Test disaster recovery procedures
[ ] Update documentation
[ ] Review capacity planning needs

Notes:
_________________________________________________
_________________________________________________
_________________________________________________

Completion Date: _______________