Maintenance Tasks - ericfitz/tmi GitHub Wiki
Maintenance Tasks
This guide covers routine maintenance tasks for TMI operations, including software updates, certificate renewal, log rotation, and scheduled maintenance windows.
Overview
Regular maintenance ensures TMI continues to operate reliably and securely. This guide covers:
- Daily, weekly, and monthly maintenance tasks
- Automated maintenance procedures
- Certificate renewal
- Log rotation and cleanup
- Software updates
- Backup verification
- Database maintenance
Maintenance Schedule
Daily Tasks
Automated:
- Health check monitoring (continuous)
- Log collection and aggregation
- Backup execution (2 AM daily)
- Metric collection and alerting
Manual (if needed):
- Review critical alerts
- Check service status
- Monitor error rates
Weekly Tasks
- Review backup integrity
- Check certificate expiration dates
- Review security alerts and logs
- Check disk space usage
- Review application logs for errors
- Monitor performance trends
Monthly Tasks
- Apply security updates
- Review and optimize database performance
- Clean up old logs and backups
- Review user access and permissions
- Test disaster recovery procedures
- Review capacity and scaling needs
- Update documentation
Quarterly Tasks
- Full security audit
- Disaster recovery test
- Performance benchmarking
- Dependency updates
- Review and update runbooks
- Team training updates
Automated Maintenance
Systemd Timers
Create automated maintenance tasks with systemd timers.
Daily Backup Timer
Create /etc/systemd/system/tmi-backup.service:
[Unit]
Description=TMI Database Backup
After=network.target postgresql.service
[Service]
Type=oneshot
User=tmi
ExecStart=/usr/local/bin/backup-tmi.sh
Create /etc/systemd/system/tmi-backup.timer:
[Unit]
Description=TMI Daily Backup Timer
Requires=tmi-backup.service
[Timer]
OnCalendar=daily
OnCalendar=02:00
Persistent=true
[Install]
WantedBy=timers.target
Enable and start:
sudo systemctl enable tmi-backup.timer
sudo systemctl start tmi-backup.timer
# Check timer status
systemctl list-timers tmi-backup.timer
Weekly Maintenance Timer
Create /etc/systemd/system/tmi-maintenance.service:
[Unit]
Description=TMI Weekly Maintenance
After=network.target
[Service]
Type=oneshot
User=tmi
ExecStart=/usr/local/bin/tmi-weekly-maintenance.sh
Create /etc/systemd/system/tmi-maintenance.timer:
[Unit]
Description=TMI Weekly Maintenance Timer
Requires=tmi-maintenance.service
[Timer]
OnCalendar=weekly
OnCalendar=Sun 03:00
Persistent=true
[Install]
WantedBy=timers.target
Cron Jobs
Alternative to systemd timers:
# Edit crontab
crontab -e
# Daily backup at 2 AM
0 2 * * * /usr/local/bin/backup-tmi.sh
# Weekly maintenance on Sunday at 3 AM
0 3 * * 0 /usr/local/bin/tmi-weekly-maintenance.sh
# Daily log rotation at midnight
0 0 * * * /usr/local/bin/rotate-tmi-logs.sh
# Certificate check daily
0 6 * * * /usr/local/bin/check-tmi-certs.sh
# Monthly cleanup on first day at 4 AM
0 4 1 * * /usr/local/bin/tmi-monthly-cleanup.sh
Certificate Management
Certificate Renewal
Let's Encrypt (Automatic)
Let's Encrypt certificates renew automatically:
# Check certbot timer status
systemctl status certbot.timer
# Test renewal
sudo certbot renew --dry-run
# Force renewal (if needed)
sudo certbot renew --force-renewal
# Check renewal logs
sudo tail -f /var/log/letsencrypt/letsencrypt.log
Manual Certificate Renewal
Create renewal script /usr/local/bin/renew-tmi-cert.sh:
#!/bin/bash
# TMI Certificate Renewal Script
CERT_DIR="/etc/tmi/certs"
BACKUP_DIR="/var/backups/tmi/certs"
LOG_FILE="/var/log/tmi/cert-renewal.log"
DAYS_BEFORE_EXPIRY=30
# Function to check certificate expiration
check_expiry() {
local cert_file=$1
local expiry_date=$(openssl x509 -enddate -noout -in "$cert_file" | cut -d= -f2)
local expiry_epoch=$(date -d "$expiry_date" +%s)
local now_epoch=$(date +%s)
local days_remaining=$(( ($expiry_epoch - $now_epoch) / 86400 ))
echo $days_remaining
}
# Check if renewal is needed
days_left=$(check_expiry "$CERT_DIR/server.crt")
if [ $days_left -le $DAYS_BEFORE_EXPIRY ]; then
echo "$(date): Certificate expires in $days_left days, renewing..." >> $LOG_FILE
# Backup old certificates
mkdir -p $BACKUP_DIR
cp $CERT_DIR/server.crt $BACKUP_DIR/server.crt.$(date +%Y%m%d)
cp $CERT_DIR/server.key $BACKUP_DIR/server.key.$(date +%Y%m%d)
# Generate new certificate (modify for your CA/provider)
openssl req -x509 -newkey rsa:4096 -nodes \
-keyout $CERT_DIR/server.key.new \
-out $CERT_DIR/server.crt.new \
-days 365 \
-subj "/CN=tmi.example.com"
# Install new certificates
mv $CERT_DIR/server.key.new $CERT_DIR/server.key
mv $CERT_DIR/server.crt.new $CERT_DIR/server.crt
# Set permissions
chmod 600 $CERT_DIR/server.key
chmod 644 $CERT_DIR/server.crt
chown tmi:tmi $CERT_DIR/*
# Restart TMI server
systemctl restart tmi
echo "$(date): Certificate renewal completed" >> $LOG_FILE
# Send notification
echo "TMI certificate renewed successfully" | \
mail -s "Certificate Renewal Completed" [email protected]
else
echo "$(date): Certificate valid for $days_left days, no renewal needed" >> $LOG_FILE
fi
Make executable and schedule:
chmod +x /usr/local/bin/renew-tmi-cert.sh
# Run daily
crontab -e
0 6 * * * /usr/local/bin/renew-tmi-cert.sh
Certificate Monitoring
Create monitoring script /usr/local/bin/check-tmi-certs.sh:
#!/bin/bash
# TMI Certificate Monitoring Script
CERT_FILE="/etc/tmi/certs/server.crt"
ALERT_DAYS=30
# Get expiration date
expiry_date=$(openssl x509 -enddate -noout -in $CERT_FILE | cut -d= -f2)
expiry_epoch=$(date -d "$expiry_date" +%s)
now_epoch=$(date +%s)
days_remaining=$(( ($expiry_epoch - $now_epoch) / 86400 ))
# Alert if expiring soon
if [ $days_remaining -le $ALERT_DAYS ]; then
echo "WARNING: TMI certificate expires in $days_remaining days" | \
mail -s "Certificate Expiration Warning" [email protected]
fi
# Log status
echo "$(date): Certificate expires in $days_remaining days" >> /var/log/tmi/cert-check.log
Log Management
Log Rotation
TMI includes automatic log rotation, but you can configure additional rotation with logrotate.
Configure Logrotate
Create /etc/logrotate.d/tmi:
/var/log/tmi/*.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 0640 tmi tmi
sharedscripts
postrotate
# Restart or reload if needed
systemctl reload tmi > /dev/null 2>&1 || true
endscript
}
Manual Log Rotation
# Force log rotation
logrotate -f /etc/logrotate.d/tmi
# Test configuration
logrotate -d /etc/logrotate.d/tmi
Log Cleanup Script
Create /usr/local/bin/cleanup-tmi-logs.sh:
#!/bin/bash
# TMI Log Cleanup Script
LOG_DIR="/var/log/tmi"
RETENTION_DAYS=90
# Delete logs older than retention period
find $LOG_DIR -name "*.log.*" -mtime +$RETENTION_DAYS -delete
# Delete compressed logs
find $LOG_DIR -name "*.gz" -mtime +$RETENTION_DAYS -delete
# Log cleanup action
echo "$(date): Cleaned up logs older than $RETENTION_DAYS days" >> $LOG_DIR/cleanup.log
# Check disk space
disk_usage=$(df -h $LOG_DIR | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $disk_usage -gt 80 ]; then
echo "WARNING: Log directory is ${disk_usage}% full" | \
mail -s "TMI Log Disk Space Warning" [email protected]
fi
Schedule:
# Monthly on first day
0 4 1 * * /usr/local/bin/cleanup-tmi-logs.sh
Archive Logs
Archive old logs to long-term storage:
#!/bin/bash
# Archive logs to S3/cloud storage
ARCHIVE_DIR="/var/log/tmi/archive"
S3_BUCKET="s3://my-tmi-logs"
DATE=$(date -d "last month" +%Y-%m)
# Create archive
mkdir -p $ARCHIVE_DIR
tar -czf $ARCHIVE_DIR/tmi-logs-$DATE.tar.gz \
/var/log/tmi/*.log.* \
--remove-files
# Upload to S3
aws s3 cp $ARCHIVE_DIR/tmi-logs-$DATE.tar.gz $S3_BUCKET/
# Remove local archive after upload
rm $ARCHIVE_DIR/tmi-logs-$DATE.tar.gz
echo "$(date): Archived and uploaded logs for $DATE" >> /var/log/tmi/archive.log
Database Maintenance
Vacuum and Analyze
Create /usr/local/bin/tmi-db-maintenance.sh:
#!/bin/bash
# TMI Database Maintenance Script
POSTGRES_HOST="postgres-host"
POSTGRES_USER="tmi_user"
POSTGRES_DB="tmi"
LOG_FILE="/var/log/tmi/db-maintenance.log"
echo "$(date): Starting database maintenance" >> $LOG_FILE
# Vacuum and analyze all tables
psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB << EOF
VACUUM ANALYZE;
EOF
if [ $? -eq 0 ]; then
echo "$(date): Database maintenance completed successfully" >> $LOG_FILE
else
echo "$(date): Database maintenance failed" >> $LOG_FILE
echo "Database maintenance failed" | \
mail -s "TMI Database Maintenance Failed" [email protected]
fi
# Check database size
db_size=$(psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -t -c \
"SELECT pg_size_pretty(pg_database_size('$POSTGRES_DB'))")
echo "$(date): Database size: $db_size" >> $LOG_FILE
# Check table bloat
psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -c "
SELECT
schemaname,
tablename,
n_dead_tup,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 10" >> $LOG_FILE
Schedule weekly:
# Sunday at 3 AM
0 3 * * 0 /usr/local/bin/tmi-db-maintenance.sh
Index Maintenance
Create /usr/local/bin/tmi-reindex.sh:
#!/bin/bash
# TMI Index Maintenance (reindex fragmented indexes)
POSTGRES_HOST="postgres-host"
POSTGRES_USER="tmi_user"
POSTGRES_DB="tmi"
LOG_FILE="/var/log/tmi/reindex.log"
echo "$(date): Starting index maintenance" >> $LOG_FILE
# Reindex database (less intrusive than REINDEX DATABASE)
psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB << EOF
REINDEX DATABASE CONCURRENTLY $POSTGRES_DB;
EOF
if [ $? -eq 0 ]; then
echo "$(date): Reindex completed successfully" >> $LOG_FILE
else
echo "$(date): Reindex failed" >> $LOG_FILE
fi
Schedule monthly:
# First Sunday at 4 AM
0 4 1-7 * 0 /usr/local/bin/tmi-reindex.sh
Backup Maintenance
Backup Verification
Create /usr/local/bin/verify-tmi-backups.sh:
#!/bin/bash
# Verify backup integrity
BACKUP_DIR="/var/backups/postgresql/tmi"
RESTORE_TEST_DB="tmi_restore_test"
LOG_FILE="/var/log/tmi/backup-verification.log"
# Find most recent backup
LATEST_BACKUP=$(ls -t $BACKUP_DIR/tmi_*.dump | head -1)
if [ -z "$LATEST_BACKUP" ]; then
echo "$(date): No backup found" >> $LOG_FILE
exit 1
fi
echo "$(date): Verifying backup: $LATEST_BACKUP" >> $LOG_FILE
# Drop test database if exists
psql -U postgres -c "DROP DATABASE IF EXISTS $RESTORE_TEST_DB"
# Create test database
createdb -U postgres $RESTORE_TEST_DB
# Restore backup to test database
pg_restore -U postgres -d $RESTORE_TEST_DB $LATEST_BACKUP 2>&1 | \
grep -v "WARNING" >> $LOG_FILE
if [ $? -eq 0 ]; then
echo "$(date): Backup verification successful" >> $LOG_FILE
# Basic validation queries
table_count=$(psql -U postgres -d $RESTORE_TEST_DB -t -c "
SELECT count(*) FROM information_schema.tables
WHERE table_schema = 'public'")
echo "$(date): Restored $table_count tables" >> $LOG_FILE
else
echo "$(date): Backup verification FAILED" >> $LOG_FILE
echo "Backup verification failed for $LATEST_BACKUP" | \
mail -s "TMI Backup Verification Failed" [email protected]
fi
# Cleanup test database
psql -U postgres -c "DROP DATABASE $RESTORE_TEST_DB"
Schedule weekly:
# Monday at 5 AM
0 5 * * 1 /usr/local/bin/verify-tmi-backups.sh
Backup Cleanup
Create /usr/local/bin/cleanup-tmi-backups.sh:
#!/bin/bash
# Clean up old backups
BACKUP_DIR="/var/backups/postgresql/tmi"
RETENTION_DAYS=30
LOG_FILE="/var/log/tmi/backup-cleanup.log"
echo "$(date): Starting backup cleanup" >> $LOG_FILE
# Count backups before cleanup
before_count=$(ls $BACKUP_DIR/tmi_*.dump 2>/dev/null | wc -l)
# Delete old backups
find $BACKUP_DIR -name "tmi_*.dump" -mtime +$RETENTION_DAYS -delete
# Count backups after cleanup
after_count=$(ls $BACKUP_DIR/tmi_*.dump 2>/dev/null | wc -l)
deleted_count=$((before_count - after_count))
echo "$(date): Deleted $deleted_count backups older than $RETENTION_DAYS days" >> $LOG_FILE
echo "$(date): $after_count backups remaining" >> $LOG_FILE
# Check backup directory size
backup_size=$(du -sh $BACKUP_DIR | awk '{print $1}')
echo "$(date): Backup directory size: $backup_size" >> $LOG_FILE
Software Updates
Update Procedure
- Review release notes for breaking changes
- Test in staging environment
- Schedule maintenance window
- Create backup before update
- Apply updates
- Run smoke tests
- Monitor for issues
TMI Server Update
#!/bin/bash
# Update TMI server
# 1. Stop service
systemctl stop tmi
# 2. Backup current version
cp /opt/tmi/tmi-server /opt/tmi/tmi-server.backup
# 3. Download new version
curl -L https://github.com/ericfitz/tmi/releases/download/v1.x.x/tmi-server-linux-amd64 \
-o /opt/tmi/tmi-server.new
# 4. Verify checksum
sha256sum /opt/tmi/tmi-server.new
# 5. Install new version
mv /opt/tmi/tmi-server.new /opt/tmi/tmi-server
chmod +x /opt/tmi/tmi-server
# 6. Start service (GORM AutoMigrate runs automatically on startup)
systemctl start tmi
# 7. Verify
sleep 5
curl http://localhost:8080/
# 8. Check logs
journalctl -u tmi -n 50
Container Updates
# Pull latest images
docker compose pull
# Recreate containers with new images
docker compose up -d
# Verify
docker compose ps
docker compose logs tmi-api
Database Updates
# Update PostgreSQL (Ubuntu)
sudo apt-get update
sudo apt-get upgrade postgresql
# Update Redis
sudo apt-get update
sudo apt-get upgrade redis-server
# Restart services
sudo systemctl restart postgresql
sudo systemctl restart redis-server
Health Checks
Weekly Health Check Script
Create /usr/local/bin/tmi-weekly-health-check.sh:
#!/bin/bash
# Weekly comprehensive health check
REPORT_FILE="/tmp/tmi-health-$(date +%Y%m%d).txt"
echo "TMI Health Check Report - $(date)" > $REPORT_FILE
echo "======================================" >> $REPORT_FILE
# Service status
echo -e "\n## Service Status" >> $REPORT_FILE
systemctl status tmi | grep -E "Active|Memory|CPU" >> $REPORT_FILE
# Database health
echo -e "\n## Database Health" >> $REPORT_FILE
psql -h postgres-host -U tmi_user -d tmi -c "
SELECT
'Connections: ' || count(*) as info
FROM pg_stat_activity
UNION ALL
SELECT
'Database Size: ' || pg_size_pretty(pg_database_size('tmi'))
" >> $REPORT_FILE
# Redis health
echo -e "\n## Redis Health" >> $REPORT_FILE
redis-cli -h redis-host info memory | grep used_memory_human >> $REPORT_FILE
redis-cli -h redis-host info stats | grep keyspace_hits >> $REPORT_FILE
# Disk space
echo -e "\n## Disk Space" >> $REPORT_FILE
df -h | grep -E "Filesystem|/var|/opt" >> $REPORT_FILE
# Certificate expiry
echo -e "\n## Certificate Status" >> $REPORT_FILE
cert_days=$(( ($(date -d "$(openssl x509 -enddate -noout -in /etc/tmi/certs/server.crt | cut -d= -f2)" +%s) - $(date +%s)) / 86400 ))
echo "Certificate expires in $cert_days days" >> $REPORT_FILE
# Recent errors
echo -e "\n## Recent Errors (last 24h)" >> $REPORT_FILE
grep -i error /var/log/tmi/tmi.log 2>/dev/null | tail -20 >> $REPORT_FILE
# Email report
cat $REPORT_FILE | mail -s "TMI Weekly Health Report" [email protected]
# Cleanup
rm $REPORT_FILE
Monitoring Maintenance
Clean Up Metrics
If using Prometheus, configure retention:
# prometheus.yml
global:
retention: 30d # Keep metrics for 30 days
# Or via command line
prometheus --storage.tsdb.retention.time=30d
Dashboard Maintenance
- Review and update dashboards monthly
- Archive unused dashboards
- Document dashboard purpose and usage
- Share dashboards with team
Documentation Maintenance
Keep operational documentation current:
- Update runbooks after incidents
- Document new procedures
- Archive outdated documentation
- Review and update quarterly
Emergency Maintenance
Unplanned Maintenance
When emergency maintenance is required:
- Assess urgency: Critical vs non-critical
- Notify stakeholders: Users, team, management
- Create incident ticket: Track the issue
- Perform maintenance: Follow emergency runbook
- Verify resolution: Test functionality
- Post-incident review: Learn and improve
Emergency Contact List
Maintain current contact list:
# /etc/tmi/emergency-contacts.yml
contacts:
primary_oncall:
name: "John Doe"
phone: "+1-555-0100"
email: "[email protected]"
backup_oncall:
name: "Jane Smith"
phone: "+1-555-0101"
email: "[email protected]"
database_admin:
name: "DB Team"
email: "[email protected]"
slack: "#db-team"
security_team:
email: "[email protected]"
phone: "+1-555-0911"
Maintenance Windows
Scheduled Maintenance
Plan maintenance windows:
- Weekly: Sunday 2-4 AM (low traffic)
- Monthly: First Sunday 2-6 AM
- Emergency: As needed with notification
Maintenance Notification
Template for user notification:
Subject: Scheduled Maintenance - TMI Service
Dear TMI Users,
We will be performing scheduled maintenance on the TMI service:
Date: Sunday, November 17, 2025
Time: 2:00 AM - 4:00 AM EST
Duration: Approximately 2 hours
During this time, TMI will be unavailable.
Maintenance activities:
- Security updates
- Database optimization
- Performance improvements
We apologize for any inconvenience.
Best regards,
TMI Operations Team
Maintenance Checklist
Print and use for regular maintenance:
TMI Monthly Maintenance Checklist
Date: _______________ Performed by: _______________
[ ] Review service health and uptime
[ ] Check disk space (goal: <75% usage)
[ ] Review and address security alerts
[ ] Apply security patches and updates
[ ] Verify backup integrity
[ ] Check certificate expiration (>30 days remaining)
[ ] Review application logs for errors
[ ] Optimize database (vacuum, analyze, reindex if needed)
[ ] Review performance metrics and trends
[ ] Clean up old logs and backups
[ ] Test disaster recovery procedures
[ ] Update documentation
[ ] Review capacity planning needs
Notes:
_________________________________________________
_________________________________________________
_________________________________________________
Completion Date: _______________
Related Documentation
- Monitoring-and-Health -- Ongoing monitoring procedures
- Database-Operations -- Database maintenance details
- Security-Operations -- Security maintenance tasks
- Performance-and-Scaling -- Performance optimization