Operations Runbook - CodySchluenz/tester GitHub Wiki
API Connect Operations Runbook
This runbook provides detailed operational procedures for the IBM API Connect platform deployed on AWS EKS. It covers day-to-day operations, monitoring, troubleshooting, and routine tasks to maintain platform health and performance.
Operations Overview
The API Connect platform requires continuous operational management to ensure optimal performance, reliability, and security. This runbook outlines standard operational procedures for the SRE team.
Operational Principles
- Consistency: Use standardized procedures across environments
- Proactive Management: Identify and address issues before they impact users
- Automation: Automate routine tasks whenever possible
- Documentation: Keep documentation current with operational changes
- Continuous Improvement: Regularly review and enhance operational procedures
Operational Responsibilities
graph TD
A[SRE Operations Team] --> B[Platform Health]
A --> C[Performance Management]
A --> D[Capacity Planning]
A --> E[Security Operations]
A --> F[User Management]
A --> G[Continuous Improvement]
B --> B1[Daily Health Checks]
B --> B2[Incident Response]
B --> B3[Backup Management]
C --> C1[Performance Monitoring]
C --> C2[Performance Tuning]
C --> C3[Bottleneck Identification]
D --> D1[Resource Monitoring]
D --> D2[Scaling Operations]
D --> D3[Capacity Forecasting]
E --> E1[Access Management]
E --> E2[Security Monitoring]
E --> E3[Vulnerability Management]
F --> F1[User Onboarding]
F --> F2[Permission Management]
F --> F3[Organization Management]
G --> G1[Process Improvement]
G --> G2[Automation Enhancement]
G --> G3[Knowledge Transfer]
Daily Operations
Morning Checklist
Task | Description | Tool/Command | Expected Result |
---|---|---|---|
Infrastructure Health | Verify all nodes and services are healthy | Dynatrace dashboard, kubectl get nodes,pods -A |
All nodes Ready, all pods Running |
API Gateway Check | Verify gateway services operational | Gateway endpoint test, metrics | 2XX responses, normal traffic patterns |
Alert Review | Review any overnight alerts | Dynatrace, ServiceNow | All alerts triaged |
Backup Verification | Verify backups completed successfully | AWS Console, backup logs | Successful backup completion |
Security Events | Review security events and logs | Splunk, CloudTrail | No unexpected security events |
Service Level Check | Verify SLOs are being met | Dynatrace SLO dashboard | All SLOs within targets |
Morning Checklist Procedure
-
Infrastructure Health Check:
# Check node status kubectl get nodes # Check for problematic pods kubectl get pods -A | grep -v "Running\|Completed" # Check resource utilization kubectl top nodes kubectl top pods -A | sort -k2 -nr | head -10
-
API Gateway Check:
# Test gateway health curl -k https://api-gateway.example.com/health # Check error rates # Use Dynatrace query or Splunk dashboard
-
Alert Review:
- Review Dynatrace problems dashboard: https://your-tenant.dynatrace.com/problems
- Check ServiceNow for overnight tickets: https://your-instance.service-now.com/incident_list.do?sysparm_query=assignment_group=api-connect^active=true
- Check Microsoft Teams channel #api-connect-alerts for notifications
-
Backup Verification:
# Check RDS automated backups aws rds describe-db-snapshots --db-instance-identifier api-connect-db --snapshot-type automated --region us-east-1 --query "DBSnapshots[?SnapshotCreateTime>='$(date -d "yesterday" +%Y-%m-%d)'].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime,Status:Status}" --output table # Check S3 backup job status aws s3 ls s3://api-connect-backups/$(date -d "yesterday" +%Y-%m-%d)/
-
Security Events Review:
- Check Splunk security dashboard: https://splunk.your-company.com/en-US/app/search/security_events
- Review failed login attempts and access violations
-
Service Level Check:
- Review SLO dashboard in Dynatrace: https://your-tenant.dynatrace.com/slo-dashboard
- Check for any services approaching breach of SLO
Evening Checklist
Task | Description | Tool/Command | Expected Result |
---|---|---|---|
End-of-Day Health Check | Verify system health at end of day | Dynatrace dashboard, kubectl | All systems healthy |
Resource Utilization Review | Check for resource constraints | Dynatrace, CloudWatch | All resources within thresholds |
Pending Ticket Review | Check for unresolved tickets | ServiceNow | All P1/P2 issues resolved |
CI/CD Pipeline Status | Verify build pipelines are healthy | Jenkins | All pipelines green |
Scheduled Jobs Check | Verify scheduled jobs completed | Jenkins, Kubernetes CronJobs | All jobs completed successfully |
On-Call Handover | Brief next on-call engineer | Microsoft Teams | Handover completed |
Evening Checklist Procedure
-
End-of-Day Health Check:
# Check overall system health kubectl get nodes,deployment,statefulset,svc -n api-connect # Check for any recent events kubectl get events -n api-connect --sort-by=.metadata.creationTimestamp | tail -20
-
Resource Utilization Review:
# Check highest CPU/memory consumers kubectl top pods -n api-connect # Check for pods approaching resource limits kubectl get pods -n api-connect -o json | jq '.items[] | {name: .metadata.name, requests: .spec.containers[].resources.requests, limits: .spec.containers[].resources.limits, usage: ""}'
-
Pending Ticket Review:
- Check ServiceNow for unresolved tickets: https://your-instance.service-now.com/incident_list.do?sysparm_query=assignment_group=api-connect^active=true
- Escalate or reassign any tickets requiring attention
-
CI/CD Pipeline Status:
- Check Jenkins build status: https://jenkins.your-company.com/job/api-connect/
- Investigate and resolve any failed builds
-
Scheduled Jobs Check:
# Check CronJob executions kubectl get cronjobs -n api-connect kubectl get jobs -n api-connect # Check for failed jobs kubectl get jobs -n api-connect -o json | jq '.items[] | select(.status.succeeded == null)'
-
On-Call Handover:
- Update on-call handover document
- Brief next on-call engineer about any ongoing issues
- Ensure they have access to all necessary resources
Daily Health Check Automation
#!/bin/bash
# Daily API Connect Health Check
echo "===== API Connect Health Check: $(date) ====="
echo -e "\n>> Checking node status..."
kubectl get nodes
echo -e "\n>> Checking non-running pods..."
kubectl get pods -A | grep -v "Running\|Completed"
echo -e "\n>> Checking API Connect deployments..."
kubectl get deployment -n api-connect
echo -e "\n>> Checking resource utilization..."
echo "Top 10 CPU-consuming pods:"
kubectl top pods -A | sort -k2 -nr | head -10
echo "Top 10 memory-consuming pods:"
kubectl top pods -A | sort -k3 -nr | head -10
echo -e "\n>> Checking recent events..."
kubectl get events -n api-connect --sort-by=.metadata.creationTimestamp | tail -20
echo -e "\n>> Checking service endpoints..."
for svc in gateway-service manager-service portal-service analytics-service; do
echo "Testing $svc..."
kubectl exec -it debug-pod -n api-connect -- curl -k -m 5 https://$svc:9443/health || echo "Failed to connect to $svc"
done
echo -e "\n>> Checking database status..."
aws rds describe-db-instances --db-instance-identifier api-connect-db --query "DBInstances[*].{Status:DBInstanceStatus,Class:DBInstanceClass,Storage:AllocatedStorage,IOPS:Iops,Connections:DBInstanceClass}" --output table --region us-east-1
echo -e "\n>> Checking recent backups..."
aws rds describe-db-snapshots --db-instance-identifier api-connect-db --snapshot-type automated --region us-east-1 --query "DBSnapshots[?SnapshotCreateTime>='$(date -d "yesterday" +%Y-%m-%d)'].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime,Status:Status}" --output table
echo -e "\n===== Health Check Complete ====="
Monitoring and Observability
Dashboard Overview
Dashboard | Purpose | URL | Primary Users |
---|---|---|---|
API Connect Overview | Platform-wide health and status | Dynatrace Dashboard | All SRE Team |
Gateway Performance | API Gateway metrics and performance | Dynatrace Dashboard | SRE Team |
Management Console | Management service health | Dynatrace Dashboard | SRE Team |
Portal Health | Developer portal status | Dynatrace Dashboard | SRE Team |
SLO Tracking | Service level objective monitoring | Dynatrace Dashboard | SRE/Management |
API Usage Analytics | Business metrics for API usage | Dynatrace Dashboard | Product Team |
Security Dashboard | Security events and compliance | Splunk Dashboard | Security Team |
Infrastructure Health | AWS/EKS infrastructure metrics | CloudWatch Dashboard | SRE Team |
Key Metrics to Monitor
Metric | Description | Warning Threshold | Critical Threshold | Data Source |
---|---|---|---|---|
Gateway Success Rate | % of successful API calls | <99.5% | <99% | Dynatrace |
Gateway Response Time (p95) | 95th percentile response time | >300ms | >500ms | Dynatrace |
Gateway Throughput | Requests per minute | <500 req/min for critical APIs | <100 req/min for critical APIs | Dynatrace |
Error Rate | % of 5XX responses | >0.1% | >1% | Dynatrace |
Node CPU Utilization | CPU usage of worker nodes | >70% | >85% | CloudWatch |
Node Memory Utilization | Memory usage of worker nodes | >70% | >85% | CloudWatch |
Pod CPU Utilization | CPU usage of pods | >70% of limit | >85% of limit | Dynatrace |
Pod Memory Utilization | Memory usage of pods | >70% of limit | >85% of limit | Dynatrace |
Database CPU Utilization | RDS CPU usage | >70% | >85% | CloudWatch |
Database Connections | Number of active DB connections | >70% of max | >85% of max | CloudWatch |
Database Storage | RDS storage utilization | >70% | >85% | CloudWatch |
Certificate Expiry | Days until certificate expiration | <30 days | <7 days | Custom |
Auth Failures | Authentication failures per minute | >5/min | >20/min | Splunk |
Rate Limit Violations | Rate limit hits per minute | >100/min | >1000/min | Splunk |
Alert Configuration
Dynatrace Alert Configuration
Alert profiles should be configured in Dynatrace to ensure proper notification routing:
# Critical Infrastructure Alerts
criticalAlerts:
name: "API Connect Critical Alerts"
severity: AVAILABILITY, PERFORMANCE, ERROR
sendToServiceNow: true
sendToTeams: true
sendToPagerDuty: true
services:
- "API Gateway"
- "Management Service"
- "Portal Service"
- "Authentication Service"
# Performance Alerts
performanceAlerts:
name: "API Connect Performance Alerts"
severity: PERFORMANCE, RESOURCE_CONTENTION
sendToServiceNow: true
sendToTeams: true
sendToPagerDuty: false
services:
- "API Gateway"
- "Management Service"
- "Portal Service"
- "Analytics Service"
# Non-critical Alerts
nonCriticalAlerts:
name: "API Connect Non-critical Alerts"
severity: INFO, CUSTOM_ALERT
sendToServiceNow: true
sendToTeams: true
sendToPagerDuty: false
services:
- "*"
Recommended Splunk Alerts
Alert | Condition | Priority | Notification |
---|---|---|---|
Gateway Error Spike | Error rate > 1% for 5 minutes | High | ServiceNow, Teams |
Authentication Failures | Auth failures > 20/min for 5 minutes | High | ServiceNow, Teams |
Database Connection Exhaustion | DB connections > 85% for 5 minutes | Critical | ServiceNow, Teams |
Certificate Expiration | Certificate expires in < 7 days | High | ServiceNow, Teams |
Security Violations | Pattern matching security violations | Critical | ServiceNow, Teams, Security Team |
API Usage Anomaly | Dramatic change in API usage patterns | Medium | ServiceNow, Teams |
Operational Monitoring Queries
Dynatrace Queries
# API Gateway Error Rate
metricSelector=builtin:service.errors.total.rate:filter(eq(service.name,API Gateway)):splitBy():sum:auto:sort(value(auto,descending))
# Response Time Trends
metricSelector=builtin:service.response.time:filter(eq(service.name,API Gateway)):splitBy(service.name):avg:auto:sort(value(auto,descending))
# Resource Utilization
metricSelector=builtin:containers.cpu.usage:filter(eq(kubernetes.pod.name,api-connect-gateway)):splitBy(kubernetes.pod.name):avg:auto:sort(value(auto,descending))
Splunk Queries
# Error Rate by API
index=api_connect sourcetype=gateway-logs status>=500 | stats count by api_name, status | sort -count
# Authentication Failures
index=api_connect sourcetype=gateway-logs "authentication failed" OR "unauthorized" | stats count by client_id, error_message
# Slow API Calls
index=api_connect sourcetype=gateway-access-logs | stats avg(response_time) as avg_resp, p95(response_time) as p95_resp by api_path | sort -p95_resp | where p95_resp > 300
# Rate Limiting Events
index=api_connect sourcetype=gateway-logs "rate limit exceeded" | timechart count by client_id
# Security Events
index=api_connect sourcetype=gateway-logs ("injection" OR "XSS" OR "attack" OR "exploit") | stats count by src_ip, api_path, event_type
Capacity Management
Resource Monitoring
Resource utilization should be monitored to ensure adequate capacity and prevent performance issues.
Key Capacity Metrics
Resource | Metric | Warning Threshold | Critical Threshold | Scaling Recommendation |
---|---|---|---|---|
Gateway CPU | CPU Utilization | 70% | 85% | Scale horizontally, add more pods |
Gateway Memory | Memory Utilization | 70% | 85% | Scale horizontally, add more pods |
Management CPU | CPU Utilization | 70% | 85% | Scale vertically, increase pod resources |
Management Memory | Memory Utilization | 70% | 85% | Scale vertically, increase pod resources |
Portal CPU | CPU Utilization | 70% | 85% | Scale horizontally, add more pods |
Portal Memory | Memory Utilization | 70% | 85% | Scale horizontally, add more pods |
Analytics CPU | CPU Utilization | 70% | 85% | Scale horizontally, add more pods |
Analytics Memory | Memory Utilization | 70% | 85% | Scale horizontally, add more pods |
EKS Node CPU | CPU Utilization | 70% | 85% | Add more nodes to cluster |
EKS Node Memory | Memory Utilization | 70% | 85% | Add more nodes to cluster |
Database CPU | CPU Utilization | 70% | 85% | Scale up RDS instance |
Database Storage | Storage Utilization | 70% | 85% | Increase storage allocation |
Database IOPS | IOPS Utilization | 70% | 85% | Increase provisioned IOPS |
Resource Utilization Monitoring
# Node-level resource monitoring
kubectl top nodes
# Pod-level resource monitoring
kubectl top pods -n api-connect
# RDS monitoring
aws cloudwatch get-metric-statistics --namespace AWS/RDS --metric-name CPUUtilization --dimensions Name=DBInstanceIdentifier,Value=api-connect-db --start-time $(date -u -d "1 hour ago" +%Y-%m-%dT%H:%M:%SZ) --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) --period 300 --statistics Average --region us-east-1
Scaling Operations
Horizontal Pod Scaling
-
Manual Scaling:
# Scale Gateway deployment kubectl scale deployment gateway-deployment -n api-connect --replicas=6 # Scale Management deployment kubectl scale deployment manager-deployment -n api-connect --replicas=3 # Scale Portal deployment kubectl scale deployment portal-deployment -n api-connect --replicas=3 # Scale Analytics deployment kubectl scale deployment analytics-deployment -n api-connect --replicas=3
-
Horizontal Pod Autoscaler (HPA) Configuration:
# Check current HPA configuration kubectl get hpa -n api-connect # Configure HPA for Gateway kubectl apply -f - <<EOF apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: gateway-hpa namespace: api-connect spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: gateway-deployment minReplicas: 3 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 EOF
Vertical Pod Scaling
For components that don't scale horizontally efficiently:
- Update Resource Requirements:
# Update Management deployment resources kubectl patch deployment manager-deployment -n api-connect -p '{ "spec": { "template": { "spec": { "containers": [ { "name": "manager", "resources": { "requests": { "cpu": "1000m", "memory": "2Gi" }, "limits": { "cpu": "2000m", "memory": "4Gi" } } } ] } } } }'
Node Scaling
-
Manual Node Scaling:
# Update node group desired capacity aws eks update-nodegroup-config --cluster-name api-connect-cluster --nodegroup-name api-connect-nodes --scaling-config desiredSize=5,minSize=3,maxSize=10 --region us-east-1
-
Cluster Autoscaler Configuration:
- Ensure Cluster Autoscaler is deployed
- Configure appropriate min/max nodes
- Monitor scale-up and scale-down events
Database Scaling
-
RDS Instance Scaling:
# Scale up RDS instance aws rds modify-db-instance --db-instance-identifier api-connect-db --db-instance-class db.m5.2xlarge --apply-immediately --region us-east-1
-
Storage Scaling:
# Increase allocated storage aws rds modify-db-instance --db-instance-identifier api-connect-db --allocated-storage 200 --apply-immediately --region us-east-1
-
Read Scaling:
# Create read replica aws rds create-db-instance-read-replica --db-instance-identifier api-connect-db-replica --source-db-instance-identifier api-connect-db --db-instance-class db.m5.xlarge --region us-east-1
Capacity Planning
Long-term capacity planning should be conducted regularly to anticipate growth and prevent resource constraints.
Capacity Planning Process
-
Data Collection:
- Collect 3-6 months of historical usage data
- Identify growth trends in API calls, users, and data storage
- Document seasonal variations and peak usage patterns
-
Growth Prediction:
- Calculate month-over-month growth rates
- Project growth for the next 6-12 months
- Consider business initiatives that may impact growth
-
Resource Estimation:
- Calculate projected resource needs based on growth
- Include headroom for unexpected spikes (typically 20%)
- Consider different scaling options (horizontal vs. vertical)
-
Capacity Plan Development:
- Document current capacity
- Define capacity upgrade schedule
- Include budget estimates
- Outline implementation approach
-
Review and Approval:
- Review with stakeholders
- Adjust based on feedback
- Obtain necessary approvals
-
Implementation:
- Execute according to schedule
- Monitor for effectiveness
- Adjust as needed
Performance Management
Performance Monitoring
Regular performance monitoring is essential to ensure optimal operation of the API Connect platform.
Key Performance Indicators
KPI | Description | Target | Measurement Method |
---|---|---|---|
API Response Time | Time to process API requests | <200ms (p95) | Dynatrace service monitoring |
API Gateway Throughput | Number of requests processed per second | >500 req/sec per node | Dynatrace service monitoring |
Management Console Response Time | UI responsiveness | <1s for page loads | Dynatrace synthetic monitoring |
Database Query Performance | Database query execution time | <100ms for 95% of queries | RDS Performance Insights |
Kubernetes API Responsiveness | Control plane performance | <300ms for API operations | Kubernetes metrics |
End-to-End Transaction Time | Complete business flow timing | Varies by transaction | Custom synthetic tests |
Performance Monitoring Tools and Techniques
-
Dynatrace Monitoring:
- Service-level monitoring
- Synthetic user journeys
- Performance hotspot analysis
- User experience monitoring
-
Database Performance Monitoring:
- RDS Performance Insights
- Slow query logging
- Connection pool monitoring
- Index performance analysis
-
Kubernetes Performance Monitoring:
- Control plane metrics
- etcd performance
- API server request latency
- kubelet performance
-
Custom Performance Tests:
- Regular JMeter load tests
- Synthetic API calls
- End-to-end business flow tests
- Performance regression testing
Performance Tuning
Gateway Performance Tuning
-
Connection Pooling:
# Update gateway configuration for connection pooling kubectl edit configmap gateway-config -n api-connect # Adjust settings: # - maxConnections # - connectionTimeout # - connectionIdleTimeout
-
Thread Pool Configuration:
# Adjust thread pool settings in gateway configuration kubectl edit configmap gateway-config -n api-connect # Adjust settings: # - threadPoolSize # - queueSize # - maxConcurrency
-
Cache Optimization:
# Update cache settings kubectl edit configmap gateway-config -n api-connect # Adjust settings: # - cacheEnabled: true # - cacheSize # - cacheTTL
-
Resource Configuration:
# Optimize gateway resource allocation kubectl edit deployment gateway-deployment -n api-connect # Adjust resource requests/limits based on observed usage
Database Performance Tuning
-
Query Optimization:
- Identify slow queries from RDS Performance Insights
- Review and optimize query patterns
- Add appropriate indexes
-
Connection Pool Management:
# Adjust connection pool settings kubectl edit configmap database-config -n api-connect # Adjust settings: # - maxPoolSize # - minPoolSize # - maxIdleTime
-
RDS Instance Optimization:
- Select appropriate instance type
- Configure optimized storage (IOPS)
- Enable performance insights
- Configure appropriate parameter groups
-
Database Maintenance:
- Regular VACUUM ANALYZE
- Index rebuilding
- Statistics updates
- See Database Maintenance
Performance Testing
Regular performance testing helps identify issues before they impact users.
Performance Test Approach
-
Baseline Testing:
- Establish performance baselines for key operations
- Document baseline metrics
- Set performance targets
-
Load Testing:
- Simulate expected user load
- Measure response times under load
- Identify bottlenecks
-
Stress Testing:
- Test system at 2-3x expected maximum load
- Identify breaking points
- Verify graceful degradation
-
Endurance Testing:
- Test system under sustained load
- Identify memory leaks or resource exhaustion
- Verify long-term stability
JMeter Test Execution
# Run baseline performance test
jmeter -n -t tests/baseline-test.jmx -l results/baseline-$(date +%Y%m%d).jtl -j logs/baseline-$(date +%Y%m%d).log
# Run load test
jmeter -n -t tests/load-test.jmx -l results/load-$(date +%Y%m%d).jtl -j logs/load-$(date +%Y%m%d).log
# Generate HTML report
jmeter -g results/load-$(date +%Y%m%d).jtl -o reports/load-$(date +%Y%m%d)
Backup and Recovery Operations
Backup Verification
Regular verification of backups is essential to ensure recoverability.
Database Backup Verification
# Check RDS automated backups
aws rds describe-db-snapshots --db-instance-identifier api-connect-db --snapshot-type automated --region us-east-1 --query "DBSnapshots[?SnapshotCreateTime>='$(date -d "yesterday" +%Y-%m-%d)'].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime,Status:Status}" --output table
# Verify backup completeness
aws rds describe-db-snapshots --db-snapshot-identifier [snapshot-id] --region us-east-1 --query "DBSnapshots[*].{Storage:SnapshotType,Encrypted:Encrypted,Status:Status,Progress:PercentProgress}" --output table
Configuration Backup Verification
# Check API Connect configuration backups
aws s3 ls s3://api-connect-backups/config/$(date -d "yesterday" +%Y-%m-%d)/
# Verify backup file integrity
aws s3api head-object --bucket api-connect-backups --key config/$(date -d "yesterday" +%Y-%m-%d)/config-backup.zip
Kubernetes Resource Backup Verification
# Check Kubernetes resource backups
aws s3 ls s3://api-connect-backups/kubernetes/$(date -d "yesterday" +%Y-%m-%d)/
# Verify backup contents
aws s3 cp s3://api-connect-backups/kubernetes/$(date -d "yesterday" +%Y-%m-%d)/resources.yaml /tmp/
kubectl apply --dry-run=client -f /tmp/resources.yaml
Periodic Recovery Testing
Recovery testing should be performed regularly to validate backup effectiveness.
Monthly Recovery Test Procedure
-
Database Recovery Test:
# Create test instance from snapshot aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier api-connect-test-restore \ --db-snapshot-identifier [snapshot-id] \ --db-instance-class db.t3.medium \ --no-multi-az \ --region us-east-1 # Verify database access PGPASSWORD=password psql -h api-connect-test-restore.abcdefghijk.us-east-1.rds.amazonaws.com -U apic_admin -d apic_db -c "SELECT count(*) FROM users;" # Clean up test instance aws rds delete-db-instance --db-instance-identifier api-connect-test-restore --skip-final-snapshot --region us-east-1
-
Configuration Recovery Test:
# Set up test environment kubectl create namespace api-connect-recovery-test # Apply recovered resources kubectl apply -f recovered-resources.yaml -n api-connect-recovery-test # Verify resources created successfully kubectl get all -n api-connect-recovery-test # Clean up kubectl delete namespace api-connect-recovery-test
Point-In-Time Recovery Procedure
For database recovery to a specific point in time:
# Restore database to point in time
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier api-connect-db \
--target-db-instance-identifier api-connect-recovered \
--restore-time $(date -u -d "2023-06-15 14:30:00" +%Y-%m-%dT%H:%M:%SZ) \
--db-instance-class db.m5.large \
--region us-east-1
# Update database connection in Kubernetes
# 1. Update configmap with new endpoint
kubectl edit configmap database-config -n api-connect
# 2. Restart pods to pick up new configuration
kubectl rollout restart deployment -n api-connect
For detailed recovery procedures, see Disaster Recovery.
User Management Operations
User Onboarding
API Connect Administrator Onboarding
-
Request Processing:
- Validate user request and approvals
- Determine appropriate role assignment
-
Platform Role Assignment:
# Create platform role binding kubectl apply -f - <<EOF apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: api-connect-admin-binding-[username] namespace: api-connect subjects: - kind: User name: [user-email] apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: api-connect-admin apiGroup: rbac.authorization.k8s.io EOF
-
API Manager Role Assignment:
- Log in to API Manager UI
- Navigate to User Management
- Add user with appropriate role
- Configure LDAP/SAML mapping if applicable
-
Access Verification:
- Verify user can log in to API Manager
- Confirm appropriate permissions
- Document access granted
API Developer Onboarding
-
Process Developer Request:
- Validate user and organization
- Determine appropriate access level
-
Create Developer Account:
- Use API Manager UI to create account
- Assign to appropriate organization
- Set correct permission level
-
API Product Access:
- Grant access to required API products
- Configure rate limits and quotas
- Set subscription approval requirements
-
Notification:
- Send welcome email with access instructions
- Provide documentation links
- Specify support channels
User Offboarding
Administrator Offboarding
-
Revoke Platform Access:
# Remove platform role binding kubectl delete rolebinding api-connect-admin-binding-[username] -n api-connect
-
Revoke API Manager Access:
- Log in to API Manager UI
- Navigate to User Management
- Deactivate user account
- Remove role assignments
-
Audit Access Removal:
- Verify user can no longer access systems
- Document access removal
- Update access records
Developer Offboarding
-
Revoke Developer Portal Access:
- Deactivate user in Developer Portal
- Revoke API subscriptions if necessary
- Transfer application ownership if needed
-
Client Credential Management:
- Invalidate client secrets
- Revoke OAuth tokens
- Deactivate API keys
-
Notification:
- Notify relevant stakeholders
- Document offboarding completion
User Access Review
Regular access reviews should be conducted to maintain security.
Quarterly Access Review Process
-
Generate Access Reports:
# Get platform role bindings kubectl get rolebinding -n api-connect > access-report-platform.txt # Export API Manager users (using UI or APIs)
-
Review Current Access:
- Compare against authorized user list
- Identify discrepancies
- Document findings
-
Remediate Issues:
- Remove unauthorized access
- Update access documentation
- Adjust processes if needed
-
Documentation:
- Document review completion
- Update access records
- Report to security team
Common Operational Tasks
API Management
Publishing a New API
-
API Creation:
- Develop API in API Manager
- Configure security and policies
- Set rate limits and quotas
-
Testing:
- Test API functionality
- Verify security controls
- Check performance characteristics
-
Publication:
- Publish to appropriate catalog
- Configure visibility settings
- Set subscription requirements
-
Verification:
- Verify API appears in Developer Portal
- Test API through gateway
- Check analytics capture
Managing API Lifecycle
-
API Versioning:
- Create new version in API Manager
- Update documentation
- Deprecate old version if needed
-
API Retirement:
- Notify subscribed users
- Set deprecation period
- Configure sunset header
- Eventually retire API
-
API Product Management:
- Create product bundles
- Configure plans and pricing
- Manage visibility
OAuth Management
OAuth Client Creation
-
Client Registration:
- Register new OAuth client
- Generate client ID and secret
- Configure redirect URIs
- Set appropriate scopes
-
Client Testing:
- Test token acquisition
- Verify scope enforcement
- Validate token usage
Token Management
-
Token Inspection:
# Check token information curl -X GET "https://api-connect-oauth.example.com/api/token/[token-id]" \ -H "Authorization: Bearer [admin-token]"
-
Token Revocation:
# Revoke specific token curl -X POST "https://api-connect-oauth.example.com/api/token/revoke" \ -H "Content-Type: application/json" \ -d '{"token": "[token-to-revoke]", "client_id": "[client-id]"}'
-
Client Secret Reset:
# Reset client secret curl -X POST "https://api-connect-oauth.example.com/api/clients/[client-id]/reset-secret" \ -H "Authorization: Bearer [admin-token]"
Rate Limit Management
Modifying Rate Limits
-
Update Global Rate Limits:
# Edit rate limit configuration kubectl edit configmap rate-limit-config -n api-connect # Update global rate limits # Apply changes kubectl rollout restart deployment gateway-deployment -n api-connect
-
Update API-Specific Rate Limits:
- Log in to API Manager UI
- Navigate to API settings
- Update rate limit policy
- Republish API
-
Update Plan Rate Limits:
- Log in to API Manager UI
- Navigate to Plans
- Update rate limits for plan
- Update products using the plan
Monitoring Rate Limit Enforcement
# Check rate limit events in logs
kubectl logs -n api-connect -l app=gateway | grep -i "rate limit" | tail -100
# Query Splunk for rate limit events
index=api_connect sourcetype=gateway-logs "rate limit exceeded" | timechart count by client_id
Certificate Management
Certificate Expiration Monitoring
# Check ACM certificate expiration
aws acm list-certificates --region us-east-1 | jq -r '.CertificateSummaryList[].CertificateArn' | while read arn; do
aws acm describe-certificate --certificate-arn $arn --region us-east-1 | jq -r '.Certificate | "\(.DomainName) expires on \(.NotAfter)"'
done
# Check Kubernetes TLS secrets
kubectl get secrets -n api-connect | grep tls | awk '{print $1}' | while read secret; do
kubectl get secret $secret -n api-connect -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddate
done
For detailed certificate management procedures, see Certificate Management.
Environment-Specific Operations
Development Environment
- Purpose: Active development and testing
- User Access: Broader access for development team
- Stability Requirements: Lower than production
- Maintenance Windows: More flexible, less formality
- Data Requirements: Test data, may be refreshed regularly
Operational Focus:
- Support development activities
- Quick resolution of blocking issues
- Regular environment refreshes
- Flexible configuration changes
Special Operations:
# Refresh development environment
kubectl delete pods --all -n api-connect
# Reset development database
kubectl exec -it postgres-util -n api-connect -- bash -c "PGPASSWORD=password psql -h api-connect-dev-db.cluster-xyz.us-east-1.rds.amazonaws.com -U apic_admin -d apic_db -c 'DELETE FROM users WHERE username NOT LIKE \"admin%\";'"
# Install development tools
kubectl apply -f dev-tools.yaml -n api-connect
Testing Environment
- Purpose: QA and automated testing
- User Access: QA team, limited development access
- Stability Requirements: Moderate, stable during test cycles
- Maintenance Windows: Coordinated with testing schedules
- Data Requirements: Controlled test data sets
Operational Focus:
- Support testing activities
- Maintain test data integrity
- Coordinate with test automation
- Track environment differences from production
Special Operations:
# Reset test data
kubectl exec -it [pod-name] -n api-connect -- bash -c "cd /opt/test-data && ./reset-test-data.sh"
# Run test suite
kubectl create job --from=cronjob/integration-tests manual-test-run -n api-connect
# Check test results
kubectl logs -n api-connect -l job-name=manual-test-run
Staging Environment
- Purpose: Pre-production validation
- User Access: Limited to operations and release teams
- Stability Requirements: High, mirror production
- Maintenance Windows: Simulated production windows
- Data Requirements: Production-like data (anonymized if needed)
Operational Focus:
- Validate production changes
- Performance testing
- Production-like operations
- Security validation
Special Operations:
# Sync staging with production configuration
./sync-staging-with-prod.sh
# Run performance test
kubectl create job --from=cronjob/performance-test perf-test-run -n api-connect
# Validate configuration
kubectl diff -f staging-vs-prod.yaml
Production Environment
- Purpose: Business operations
- User Access: Highly restricted
- Stability Requirements: Maximum stability
- Maintenance Windows: Formal, scheduled windows only
- Data Requirements: Live business data, high protection
Operational Focus:
- Maximize availability and performance
- Strict change control
- Comprehensive monitoring
- Rapid incident response
Special Operations:
# Production health check
./production-health-check.sh
# Check SLO compliance
curl -X GET "https://your-tenant.dynatrace.com/api/v2/slo" \
-H "Authorization: Api-Token $DYNATRACE_TOKEN" | jq '.slo[] | {name: .name, status: .status}'
# Emergency rollback (requires approval)
kubectl rollout undo deployment [deployment-name] -n api-connect
DR Environment
- Purpose: Business continuity
- User Access: Extremely limited
- Stability Requirements: Match production
- Maintenance Windows: Coordinated with production
- Data Requirements: Synchronized with production
Operational Focus:
- Maintain synchronization
- Verify failover readiness
- Test recovery procedures
- Match production configuration
Special Operations:
# Check replication status
aws rds describe-db-instances --db-instance-identifier api-connect-db-replica --query "DBInstances[0].ReplicaLag" --region us-west-2
# DR readiness check
./verify-dr-readiness.sh
# DR failover test (scheduled)
./dr-failover-test.sh
Troubleshooting Procedures
Flowchart for General Troubleshooting
graph TD
A[Issue Detected] --> B{Scope?}
B -->|Platform-wide| C[Check Infrastructure]
B -->|Component-specific| D[Check Component]
B -->|API-specific| E[Check API Configuration]
B -->|User-specific| F[Check User Permissions]
C --> C1[Check Node Status]
C --> C2[Check Control Plane]
C --> C3[Check Network Connectivity]
D --> D1[Check Pod Status]
D --> D2[Check Logs]
D --> D3[Check Resource Usage]
E --> E1[Check API Definition]
E --> E2[Check Gateway Policy]
E --> E3[Check Backend Connectivity]
F --> F1[Check Authentication]
F --> F2[Check Authorization]
F --> F3[Check Rate Limits]
C1 --> G[Resolution Steps]
C2 --> G
C3 --> G
D1 --> G
D2 --> G
D3 --> G
E1 --> G
E2 --> G
E3 --> G
F1 --> G
F2 --> G
F3 --> G
Common Error Scenarios and Resolutions
API Gateway 5XX Errors
Symptoms:
- API clients receiving 5XX responses
- Error rate alerts
- Backend service connectivity issues
Diagnostic Steps:
-
Check Gateway pod status:
kubectl get pods -n api-connect -l app=gateway
-
Check Gateway logs:
kubectl logs -n api-connect -l app=gateway | grep -i error | tail -100
-
Check backend service connectivity:
kubectl exec -it -n api-connect [gateway-pod-name] -- curl -v [backend-service-url]
Resolution Steps:
-
If gateway pods are unhealthy, restart them:
kubectl rollout restart deployment gateway-deployment -n api-connect
-
If backend service is unavailable, check the service:
- Verify service health
- Check network connectivity
- Check security settings
-
If configuration issue, update gateway config:
kubectl edit configmap gateway-config -n api-connect # Update configuration as needed # Apply changes kubectl rollout restart deployment gateway-deployment -n api-connect
Authentication Failures
Symptoms:
- Users unable to log in
- OAuth token acquisition failures
- Increased 401/403 errors
Diagnostic Steps:
-
Check authentication logs:
kubectl logs -n api-connect -l app=gateway | grep -i "auth\|authentication\|token" | tail -100
-
Verify identity provider connectivity:
kubectl exec -it -n api-connect [pod-name] -- curl -v [identity-provider-url]
-
Check OAuth service status:
kubectl get pods -n api-connect -l app=oauth kubectl logs -n api-connect -l app=oauth | grep -i error
Resolution Steps:
-
If identity provider issue:
- Check connectivity
- Verify configuration
- Check if service is operational
-
If OAuth service issue:
- Restart OAuth pods:
kubectl rollout restart deployment oauth-deployment -n api-connect
- Check configuration:
kubectl edit configmap oauth-config -n api-connect
- Restart OAuth pods:
-
If credential issue:
- Reset client credentials
- Verify correct usage
- Check for expired tokens
Performance Degradation
Symptoms:
- Increased API response times
- Timeout errors
- Resource utilization alerts
Diagnostic Steps:
-
Check resource utilization:
kubectl top pods -n api-connect kubectl top nodes
-
Check for slow database queries:
- Review RDS Performance Insights
- Check for long-running transactions
-
Check for high traffic patterns:
- Review API traffic metrics
- Look for unusual access patterns
Resolution Steps:
-
If resource constraints:
- Scale out resources:
kubectl scale deployment gateway-deployment -n api-connect --replicas=8
- Increase resource limits:
kubectl edit deployment gateway-deployment -n api-connect # Update resource limits
- Scale out resources:
-
If database performance:
- Optimize slow queries
- Check connection pool
- Consider instance scaling:
aws rds modify-db-instance --db-instance-identifier api-connect-db --db-instance-class db.m5.2xlarge --apply-immediately --region us-east-1
-
If traffic pattern issue:
- Implement or adjust rate limiting
- Check for abusive clients
- Consider traffic shaping
Kubernetes Issues
Symptoms:
- Pod scheduling failures
- Control plane issues
- Network connectivity problems
Diagnostic Steps:
-
Check node status:
kubectl get nodes kubectl describe node [problem-node]
-
Check pod events:
kubectl get events -n api-connect --sort-by=.metadata.creationTimestamp | tail -50
-
Check control plane status:
kubectl get componentstatuses
Resolution Steps:
-
If node issues:
- Drain problematic node:
kubectl drain [node-name] --ignore-daemonsets --delete-emptydir-data
- Investigate node problems:
aws ec2 describe-instances --instance-id [instance-id] --region us-east-1
- Replace if necessary:
aws ec2 terminate-instances --instance-ids [instance-id] --region us-east-1
- Drain problematic node:
-
If pod scheduling issues:
- Check resource availability
- Check node selectors/taints
- Check PV/PVC status
-
If control plane issues:
- Contact AWS for EKS control plane issues
- Check for AWS status page incidents
- Verify networking to control plane
Operational Metrics and Reporting
Regular Reports
Report | Frequency | Audience | Content |
---|---|---|---|
Daily Status Report | Daily | SRE Team | System health, incidents, alerts |
Weekly Performance Report | Weekly | SRE Team, Product Owners | Performance trends, capacity usage, API metrics |
Monthly SLO Report | Monthly | SRE Team, Management | SLO compliance, availability, incident summary |
Quarterly Business Review | Quarterly | Product Team, Executive | Strategic metrics, growth trends, capacity needs |
Operational Metrics
Metric Category | Example Metrics | Reporting Frequency |
---|---|---|
Availability | Uptime percentage, outage count/duration | Daily, Weekly, Monthly |
Performance | Response time trends, throughput patterns | Weekly |
Capacity | Resource utilization trends, scaling events | Weekly |
Incidents | Count by severity, MTTR, recurring issues | Weekly, Monthly |
SLOs | SLO compliance, error budget consumption | Monthly |
Business Metrics | API call volume, user growth, revenue impact | Monthly, Quarterly |
Report Generation
// Jenkins pipeline for report generation
pipeline {
agent any
triggers {
cron('0 7 * * 1') // Weekly on Monday at 7 AM
}
stages {
stage('Gather Metrics') {
steps {
sh ```
# Get availability data
curl -X GET "https://your-tenant.dynatrace.com/api/v2/metrics/query?metricSelector=builtin:service.availability:filter(eq(service.name,API%20Gateway))&from=-7d&to=now" -H "Authorization: Api-Token $DYNATRACE_TOKEN" > availability.json
# Get performance data
curl -X GET "https://your-tenant.dynatrace.com/api/v2/metrics/query?metricSelector=builtin:service.response.time:filter(eq(service.name,API%20Gateway)):splitBy():avg&from=-7d&to=now" -H "Authorization: Api-Token $DYNATRACE_TOKEN" > performance.json
# Get utilization data
kubectl top nodes > node-utilization.txt
kubectl top pods -n api-connect > pod-utilization.txt
# Get SLO data
curl -X GET "https://your-tenant.dynatrace.com/api/v2/slo" -H "Authorization: Api-Token $DYNATRACE_TOKEN" > slo.json
```
}
}
stage('Generate Report') {
steps {
sh ```
# Run report generation script
python generate_weekly_report.py > weekly-report.html
```
}
}
stage('Distribute Report') {
steps {
emailext body: '${FILE,path="weekly-report.html"}',
mimeType: 'text/html',
subject: "API Connect Weekly Performance Report - ${BUILD_TIMESTAMP}",
to: '[email protected],[email protected]'
}
}
}
post {
always {
archiveArtifacts artifacts: 'weekly-report.html', fingerprint: true
}
}
}
References and Resources
Quick Reference Commands
Task | Command |
---|---|
Check node status | kubectl get nodes |
Check pod status | kubectl get pods -n api-connect |
View pod logs | kubectl logs -n api-connect [pod-name] |
Check pod resource usage | kubectl top pods -n api-connect |
Restart a deployment | kubectl rollout restart deployment [deployment-name] -n api-connect |
Scale a deployment | kubectl scale deployment [deployment-name] -n api-connect --replicas=[count] |
Check service status | kubectl get svc -n api-connect |
Check recent events | kubectl get events -n api-connect --sort-by=.metadata.creationTimestamp |
Execute command in pod | kubectl exec -it [pod-name] -n api-connect -- [command] |
Check RDS status | aws rds describe-db-instances --db-instance-identifier api-connect-db --region us-east-1 |
Check recent backups | aws rds describe-db-snapshots --db-instance-identifier api-connect-db --snapshot-type automated --region us-east-1 |
Check certificate expiry | aws acm describe-certificate --certificate-arn [arn] --region us-east-1 |
Operational Checklists
Daily Operational Checklist
## API Connect Daily Health Check
### Infrastructure
- [ ] All nodes are in Ready state
- [ ] All pods are in Running state (or Completed for jobs)
- [ ] Resource utilization within normal ranges
- [ ] No concerning events in Kubernetes logs
### API Connect Components
- [ ] Gateway services operational
- [ ] Management services operational
- [ ] Portal services operational
- [ ] Analytics services operational
### Database
- [ ] Database instance healthy
- [ ] Connection counts within normal ranges
- [ ] No slow query issues
- [ ] Backup completed successfully
### Security
- [ ] No unusual authentication failures
- [ ] No suspicious access patterns
- [ ] Certificate status validated
### Monitoring
- [ ] All monitoring systems operational
- [ ] SLOs within targets
- [ ] No unacknowledged alerts
Weekly Operational Checklist
## API Connect Weekly Operational Review
### Performance
- [ ] Review performance trends for all components
- [ ] Identify any degradation patterns
- [ ] Check for resource bottlenecks
- [ ] Validate API response times
### Capacity
- [ ] Review resource utilization trends
- [ ] Check for approaching thresholds
- [ ] Update capacity forecasts if needed
- [ ] Plan for any scaling needs
### Backups
- [ ] Verify all backups completed successfully
- [ ] Check backup retention compliance
- [ ] Perform recovery test if scheduled
- [ ] Validate backup procedures
### Security
- [ ] Review security events and logs
- [ ] Check for certificate expirations
- [ ] Validate access controls
- [ ] Review security scan results
### Maintenance
- [ ] Review upcoming maintenance needs
- [ ] Schedule required maintenance
- [ ] Verify patch status
- [ ] Update documentation as needed
Operational Documentation
The SRE team should maintain the following operational documentation:
- Runbooks: Detailed procedures for specific tasks (this document)
- Incident Postmortems: Documentation of past incidents and resolution
- Change Logs: Record of all changes made to the platform
- SLO Reports: Regular reporting on SLO compliance
- Capacity Plans: Documentation of capacity planning and forecasts
- Architecture Diagrams: Up-to-date diagrams of the platform architecture
- Knowledge Base: Collection of troubleshooting tips and solutions
Related Documentation
- Main Runbook - Main platform runbook
- Gateway Runbook - Gateway component runbook
- Management Runbook - Management component runbook
- Portal Runbook - Portal component runbook
- Infrastructure Runbook - Infrastructure management
- Database Runbook - Database-specific procedures
- Maintenance Runbook - Planned maintenance procedures
- Incident Management - Incident response procedures
- Architecture - Platform architecture documentation
- Observability - Monitoring and observability details
- IBM API Connect Documentation
Contact Information
Role | Contact | Availability |
---|---|---|
SRE Team | [email protected] | 24/7 via Teams |
Database Team | [email protected] | Business hours + on-call |
Network Team | [email protected] | Business hours + on-call |
Security Team | [email protected] | Business hours + on-call |
Product Team | [email protected] | Business hours |
IBM Support | IBM Support Portal (Case #IBM-12345) | 24/7 with support contract |
AWS Support | AWS Support Portal (Account #AWS-67890) | 24/7 with support contract |