Operations Runbook - CodySchluenz/tester GitHub Wiki

API Connect Operations Runbook

This runbook provides detailed operational procedures for the IBM API Connect platform deployed on AWS EKS. It covers day-to-day operations, monitoring, troubleshooting, and routine tasks to maintain platform health and performance.

Operations Overview

The API Connect platform requires continuous operational management to ensure optimal performance, reliability, and security. This runbook outlines standard operational procedures for the SRE team.

Operational Principles

Consistency: Use standardized procedures across environments
Proactive Management: Identify and address issues before they impact users
Automation: Automate routine tasks whenever possible
Documentation: Keep documentation current with operational changes
Continuous Improvement: Regularly review and enhance operational procedures

Operational Responsibilities

graph TD
    A[SRE Operations Team] --> B[Platform Health]
    A --> C[Performance Management]
    A --> D[Capacity Planning]
    A --> E[Security Operations]
    A --> F[User Management]
    A --> G[Continuous Improvement]
    
    B --> B1[Daily Health Checks]
    B --> B2[Incident Response]
    B --> B3[Backup Management]
    
    C --> C1[Performance Monitoring]
    C --> C2[Performance Tuning]
    C --> C3[Bottleneck Identification]
    
    D --> D1[Resource Monitoring]
    D --> D2[Scaling Operations]
    D --> D3[Capacity Forecasting]
    
    E --> E1[Access Management]
    E --> E2[Security Monitoring]
    E --> E3[Vulnerability Management]
    
    F --> F1[User Onboarding]
    F --> F2[Permission Management]
    F --> F3[Organization Management]
    
    G --> G1[Process Improvement]
    G --> G2[Automation Enhancement]
    G --> G3[Knowledge Transfer]

Daily Operations

Morning Checklist

Task	Description	Tool/Command	Expected Result
Infrastructure Health	Verify all nodes and services are healthy	Dynatrace dashboard, `kubectl get nodes,pods -A`	All nodes Ready, all pods Running
API Gateway Check	Verify gateway services operational	Gateway endpoint test, metrics	2XX responses, normal traffic patterns
Alert Review	Review any overnight alerts	Dynatrace, ServiceNow	All alerts triaged
Backup Verification	Verify backups completed successfully	AWS Console, backup logs	Successful backup completion
Security Events	Review security events and logs	Splunk, CloudTrail	No unexpected security events
Service Level Check	Verify SLOs are being met	Dynatrace SLO dashboard	All SLOs within targets

Morning Checklist Procedure

Infrastructure Health Check:

# Check node status
kubectl get nodes

# Check for problematic pods
kubectl get pods -A | grep -v "Running\|Completed"

# Check resource utilization
kubectl top nodes
kubectl top pods -A | sort -k2 -nr | head -10

API Gateway Check:

# Test gateway health
curl -k https://api-gateway.example.com/health

# Check error rates
# Use Dynatrace query or Splunk dashboard

Alert Review:
- Review Dynatrace problems dashboard: https://your-tenant.dynatrace.com/problems
- Check ServiceNow for overnight tickets: https://your-instance.service-now.com/incident_list.do?sysparm_query=assignment_group=api-connect^active=true
- Check Microsoft Teams channel #api-connect-alerts for notifications

Backup Verification:

# Check RDS automated backups
aws rds describe-db-snapshots --db-instance-identifier api-connect-db --snapshot-type automated --region us-east-1 --query "DBSnapshots[?SnapshotCreateTime>='$(date -d "yesterday" +%Y-%m-%d)'].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime,Status:Status}" --output table

# Check S3 backup job status
aws s3 ls s3://api-connect-backups/$(date -d "yesterday" +%Y-%m-%d)/

Security Events Review:
- Check Splunk security dashboard: https://splunk.your-company.com/en-US/app/search/security_events
- Review failed login attempts and access violations
Service Level Check:
- Review SLO dashboard in Dynatrace: https://your-tenant.dynatrace.com/slo-dashboard
- Check for any services approaching breach of SLO

Evening Checklist

Task	Description	Tool/Command	Expected Result
End-of-Day Health Check	Verify system health at end of day	Dynatrace dashboard, kubectl	All systems healthy
Resource Utilization Review	Check for resource constraints	Dynatrace, CloudWatch	All resources within thresholds
Pending Ticket Review	Check for unresolved tickets	ServiceNow	All P1/P2 issues resolved
CI/CD Pipeline Status	Verify build pipelines are healthy	Jenkins	All pipelines green
Scheduled Jobs Check	Verify scheduled jobs completed	Jenkins, Kubernetes CronJobs	All jobs completed successfully
On-Call Handover	Brief next on-call engineer	Microsoft Teams	Handover completed

Evening Checklist Procedure

End-of-Day Health Check:

# Check overall system health
kubectl get nodes,deployment,statefulset,svc -n api-connect

# Check for any recent events
kubectl get events -n api-connect --sort-by=.metadata.creationTimestamp | tail -20

Resource Utilization Review:

# Check highest CPU/memory consumers
kubectl top pods -n api-connect

# Check for pods approaching resource limits
kubectl get pods -n api-connect -o json | jq '.items[] | {name: .metadata.name, requests: .spec.containers[].resources.requests, limits: .spec.containers[].resources.limits, usage: ""}'

Pending Ticket Review:
- Check ServiceNow for unresolved tickets: https://your-instance.service-now.com/incident_list.do?sysparm_query=assignment_group=api-connect^active=true
- Escalate or reassign any tickets requiring attention
CI/CD Pipeline Status:
- Check Jenkins build status: https://jenkins.your-company.com/job/api-connect/
- Investigate and resolve any failed builds

Scheduled Jobs Check:

# Check CronJob executions
kubectl get cronjobs -n api-connect
kubectl get jobs -n api-connect

# Check for failed jobs
kubectl get jobs -n api-connect -o json | jq '.items[] | select(.status.succeeded == null)'

On-Call Handover:
- Update on-call handover document
- Brief next on-call engineer about any ongoing issues
- Ensure they have access to all necessary resources

Daily Health Check Automation

#!/bin/bash
# Daily API Connect Health Check

echo "===== API Connect Health Check: $(date) ====="

echo -e "\n>> Checking node status..."
kubectl get nodes

echo -e "\n>> Checking non-running pods..."
kubectl get pods -A | grep -v "Running\|Completed"

echo -e "\n>> Checking API Connect deployments..."
kubectl get deployment -n api-connect

echo -e "\n>> Checking resource utilization..."
echo "Top 10 CPU-consuming pods:"
kubectl top pods -A | sort -k2 -nr | head -10

echo "Top 10 memory-consuming pods:"
kubectl top pods -A | sort -k3 -nr | head -10

echo -e "\n>> Checking recent events..."
kubectl get events -n api-connect --sort-by=.metadata.creationTimestamp | tail -20

echo -e "\n>> Checking service endpoints..."
for svc in gateway-service manager-service portal-service analytics-service; do
  echo "Testing $svc..."
  kubectl exec -it debug-pod -n api-connect -- curl -k -m 5 https://$svc:9443/health || echo "Failed to connect to $svc"
done

echo -e "\n>> Checking database status..."
aws rds describe-db-instances --db-instance-identifier api-connect-db --query "DBInstances[*].{Status:DBInstanceStatus,Class:DBInstanceClass,Storage:AllocatedStorage,IOPS:Iops,Connections:DBInstanceClass}" --output table --region us-east-1

echo -e "\n>> Checking recent backups..."
aws rds describe-db-snapshots --db-instance-identifier api-connect-db --snapshot-type automated --region us-east-1 --query "DBSnapshots[?SnapshotCreateTime>='$(date -d "yesterday" +%Y-%m-%d)'].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime,Status:Status}" --output table

echo -e "\n===== Health Check Complete ====="

Monitoring and Observability

Dashboard Overview

Dashboard	Purpose	URL	Primary Users
API Connect Overview	Platform-wide health and status	Dynatrace Dashboard	All SRE Team
Gateway Performance	API Gateway metrics and performance	Dynatrace Dashboard	SRE Team
Management Console	Management service health	Dynatrace Dashboard	SRE Team
Portal Health	Developer portal status	Dynatrace Dashboard	SRE Team
SLO Tracking	Service level objective monitoring	Dynatrace Dashboard	SRE/Management
API Usage Analytics	Business metrics for API usage	Dynatrace Dashboard	Product Team
Security Dashboard	Security events and compliance	Splunk Dashboard	Security Team
Infrastructure Health	AWS/EKS infrastructure metrics	CloudWatch Dashboard	SRE Team

Key Metrics to Monitor

Metric	Description	Warning Threshold	Critical Threshold	Data Source
Gateway Success Rate	% of successful API calls	<99.5%	<99%	Dynatrace
Gateway Response Time (p95)	95th percentile response time	>300ms	>500ms	Dynatrace
Gateway Throughput	Requests per minute	<500 req/min for critical APIs	<100 req/min for critical APIs	Dynatrace
Error Rate	% of 5XX responses	>0.1%	>1%	Dynatrace
Node CPU Utilization	CPU usage of worker nodes	>70%	>85%	CloudWatch
Node Memory Utilization	Memory usage of worker nodes	>70%	>85%	CloudWatch
Pod CPU Utilization	CPU usage of pods	>70% of limit	>85% of limit	Dynatrace
Pod Memory Utilization	Memory usage of pods	>70% of limit	>85% of limit	Dynatrace
Database CPU Utilization	RDS CPU usage	>70%	>85%	CloudWatch
Database Connections	Number of active DB connections	>70% of max	>85% of max	CloudWatch
Database Storage	RDS storage utilization	>70%	>85%	CloudWatch
Certificate Expiry	Days until certificate expiration	<30 days	<7 days	Custom
Auth Failures	Authentication failures per minute	>5/min	>20/min	Splunk
Rate Limit Violations	Rate limit hits per minute	>100/min	>1000/min	Splunk

Alert Configuration

Dynatrace Alert Configuration

Alert profiles should be configured in Dynatrace to ensure proper notification routing:

# Critical Infrastructure Alerts
criticalAlerts:
  name: "API Connect Critical Alerts"
  severity: AVAILABILITY, PERFORMANCE, ERROR
  sendToServiceNow: true
  sendToTeams: true
  sendToPagerDuty: true
  services:
    - "API Gateway"
    - "Management Service"
    - "Portal Service"
    - "Authentication Service"

# Performance Alerts
performanceAlerts:
  name: "API Connect Performance Alerts"
  severity: PERFORMANCE, RESOURCE_CONTENTION
  sendToServiceNow: true
  sendToTeams: true
  sendToPagerDuty: false
  services:
    - "API Gateway"
    - "Management Service"
    - "Portal Service"
    - "Analytics Service"

# Non-critical Alerts
nonCriticalAlerts:
  name: "API Connect Non-critical Alerts"
  severity: INFO, CUSTOM_ALERT
  sendToServiceNow: true
  sendToTeams: true
  sendToPagerDuty: false
  services:
    - "*"

Recommended Splunk Alerts

Alert	Condition	Priority	Notification
Gateway Error Spike	Error rate > 1% for 5 minutes	High	ServiceNow, Teams
Authentication Failures	Auth failures > 20/min for 5 minutes	High	ServiceNow, Teams
Database Connection Exhaustion	DB connections > 85% for 5 minutes	Critical	ServiceNow, Teams
Certificate Expiration	Certificate expires in < 7 days	High	ServiceNow, Teams
Security Violations	Pattern matching security violations	Critical	ServiceNow, Teams, Security Team
API Usage Anomaly	Dramatic change in API usage patterns	Medium	ServiceNow, Teams

Operational Monitoring Queries

Dynatrace Queries

# API Gateway Error Rate
metricSelector=builtin:service.errors.total.rate:filter(eq(service.name,API Gateway)):splitBy():sum:auto:sort(value(auto,descending))

# Response Time Trends
metricSelector=builtin:service.response.time:filter(eq(service.name,API Gateway)):splitBy(service.name):avg:auto:sort(value(auto,descending))

# Resource Utilization
metricSelector=builtin:containers.cpu.usage:filter(eq(kubernetes.pod.name,api-connect-gateway)):splitBy(kubernetes.pod.name):avg:auto:sort(value(auto,descending))

Splunk Queries

# Error Rate by API
index=api_connect sourcetype=gateway-logs status>=500 | stats count by api_name, status | sort -count

# Authentication Failures
index=api_connect sourcetype=gateway-logs "authentication failed" OR "unauthorized" | stats count by client_id, error_message

# Slow API Calls
index=api_connect sourcetype=gateway-access-logs | stats avg(response_time) as avg_resp, p95(response_time) as p95_resp by api_path | sort -p95_resp | where p95_resp > 300

# Rate Limiting Events
index=api_connect sourcetype=gateway-logs "rate limit exceeded" | timechart count by client_id

# Security Events
index=api_connect sourcetype=gateway-logs ("injection" OR "XSS" OR "attack" OR "exploit") | stats count by src_ip, api_path, event_type

Capacity Management

Resource Monitoring

Resource utilization should be monitored to ensure adequate capacity and prevent performance issues.

Key Capacity Metrics

Resource	Metric	Warning Threshold	Critical Threshold	Scaling Recommendation
Gateway CPU	CPU Utilization	70%	85%	Scale horizontally, add more pods
Gateway Memory	Memory Utilization	70%	85%	Scale horizontally, add more pods
Management CPU	CPU Utilization	70%	85%	Scale vertically, increase pod resources
Management Memory	Memory Utilization	70%	85%	Scale vertically, increase pod resources
Portal CPU	CPU Utilization	70%	85%	Scale horizontally, add more pods
Portal Memory	Memory Utilization	70%	85%	Scale horizontally, add more pods
Analytics CPU	CPU Utilization	70%	85%	Scale horizontally, add more pods
Analytics Memory	Memory Utilization	70%	85%	Scale horizontally, add more pods
EKS Node CPU	CPU Utilization	70%	85%	Add more nodes to cluster
EKS Node Memory	Memory Utilization	70%	85%	Add more nodes to cluster
Database CPU	CPU Utilization	70%	85%	Scale up RDS instance
Database Storage	Storage Utilization	70%	85%	Increase storage allocation
Database IOPS	IOPS Utilization	70%	85%	Increase provisioned IOPS

Resource Utilization Monitoring

# Node-level resource monitoring
kubectl top nodes

# Pod-level resource monitoring
kubectl top pods -n api-connect

# RDS monitoring
aws cloudwatch get-metric-statistics --namespace AWS/RDS --metric-name CPUUtilization --dimensions Name=DBInstanceIdentifier,Value=api-connect-db --start-time $(date -u -d "1 hour ago" +%Y-%m-%dT%H:%M:%SZ) --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) --period 300 --statistics Average --region us-east-1

Scaling Operations

Horizontal Pod Scaling

Manual Scaling:

# Scale Gateway deployment
kubectl scale deployment gateway-deployment -n api-connect --replicas=6

# Scale Management deployment
kubectl scale deployment manager-deployment -n api-connect --replicas=3

# Scale Portal deployment
kubectl scale deployment portal-deployment -n api-connect --replicas=3

# Scale Analytics deployment
kubectl scale deployment analytics-deployment -n api-connect --replicas=3

Horizontal Pod Autoscaler (HPA) Configuration:

# Check current HPA configuration
kubectl get hpa -n api-connect

# Configure HPA for Gateway
kubectl apply -f - <<EOF
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gateway-hpa
  namespace: api-connect
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: gateway-deployment
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
EOF

Vertical Pod Scaling

For components that don't scale horizontally efficiently:

Update Resource Requirements:

# Update Management deployment resources
kubectl patch deployment manager-deployment -n api-connect -p '{
  "spec": {
    "template": {
      "spec": {
        "containers": [
          {
            "name": "manager",
            "resources": {
              "requests": {
                "cpu": "1000m",
                "memory": "2Gi"
              },
              "limits": {
                "cpu": "2000m",
                "memory": "4Gi"
              }
            }
          }
        ]
      }
    }
  }
}'

Node Scaling

Manual Node Scaling:

# Update node group desired capacity
aws eks update-nodegroup-config --cluster-name api-connect-cluster --nodegroup-name api-connect-nodes --scaling-config desiredSize=5,minSize=3,maxSize=10 --region us-east-1

Cluster Autoscaler Configuration:
- Ensure Cluster Autoscaler is deployed
- Configure appropriate min/max nodes
- Monitor scale-up and scale-down events

Database Scaling

RDS Instance Scaling:

# Scale up RDS instance
aws rds modify-db-instance --db-instance-identifier api-connect-db --db-instance-class db.m5.2xlarge --apply-immediately --region us-east-1

Storage Scaling:

# Increase allocated storage
aws rds modify-db-instance --db-instance-identifier api-connect-db --allocated-storage 200 --apply-immediately --region us-east-1

Read Scaling:

# Create read replica
aws rds create-db-instance-read-replica --db-instance-identifier api-connect-db-replica --source-db-instance-identifier api-connect-db --db-instance-class db.m5.xlarge --region us-east-1

Capacity Planning

Long-term capacity planning should be conducted regularly to anticipate growth and prevent resource constraints.

Capacity Planning Process

Data Collection:
- Collect 3-6 months of historical usage data
- Identify growth trends in API calls, users, and data storage
- Document seasonal variations and peak usage patterns
Growth Prediction:
- Calculate month-over-month growth rates
- Project growth for the next 6-12 months
- Consider business initiatives that may impact growth
Resource Estimation:
- Calculate projected resource needs based on growth
- Include headroom for unexpected spikes (typically 20%)
- Consider different scaling options (horizontal vs. vertical)
Capacity Plan Development:
- Document current capacity
- Define capacity upgrade schedule
- Include budget estimates
- Outline implementation approach
Review and Approval:
- Review with stakeholders
- Adjust based on feedback
- Obtain necessary approvals
Implementation:
- Execute according to schedule
- Monitor for effectiveness
- Adjust as needed

Performance Management

Performance Monitoring

Regular performance monitoring is essential to ensure optimal operation of the API Connect platform.

Key Performance Indicators

KPI	Description	Target	Measurement Method
API Response Time	Time to process API requests	<200ms (p95)	Dynatrace service monitoring
API Gateway Throughput	Number of requests processed per second	>500 req/sec per node	Dynatrace service monitoring
Management Console Response Time	UI responsiveness	<1s for page loads	Dynatrace synthetic monitoring
Database Query Performance	Database query execution time	<100ms for 95% of queries	RDS Performance Insights
Kubernetes API Responsiveness	Control plane performance	<300ms for API operations	Kubernetes metrics
End-to-End Transaction Time	Complete business flow timing	Varies by transaction	Custom synthetic tests

Performance Monitoring Tools and Techniques

Dynatrace Monitoring:
- Service-level monitoring
- Synthetic user journeys
- Performance hotspot analysis
- User experience monitoring
Database Performance Monitoring:
- RDS Performance Insights
- Slow query logging
- Connection pool monitoring
- Index performance analysis
Kubernetes Performance Monitoring:
- Control plane metrics
- etcd performance
- API server request latency
- kubelet performance
Custom Performance Tests:
- Regular JMeter load tests
- Synthetic API calls
- End-to-end business flow tests
- Performance regression testing

Performance Tuning

Gateway Performance Tuning

Connection Pooling:

# Update gateway configuration for connection pooling
kubectl edit configmap gateway-config -n api-connect
# Adjust settings:
# - maxConnections
# - connectionTimeout
# - connectionIdleTimeout

Thread Pool Configuration:

# Adjust thread pool settings in gateway configuration
kubectl edit configmap gateway-config -n api-connect
# Adjust settings:
# - threadPoolSize
# - queueSize
# - maxConcurrency

Cache Optimization:

# Update cache settings
kubectl edit configmap gateway-config -n api-connect
# Adjust settings:
# - cacheEnabled: true
# - cacheSize
# - cacheTTL

Resource Configuration:

# Optimize gateway resource allocation
kubectl edit deployment gateway-deployment -n api-connect
# Adjust resource requests/limits based on observed usage

Database Performance Tuning

Query Optimization:
- Identify slow queries from RDS Performance Insights
- Review and optimize query patterns
- Add appropriate indexes

Connection Pool Management:

# Adjust connection pool settings
kubectl edit configmap database-config -n api-connect
# Adjust settings:
# - maxPoolSize
# - minPoolSize
# - maxIdleTime

RDS Instance Optimization:
- Select appropriate instance type
- Configure optimized storage (IOPS)
- Enable performance insights
- Configure appropriate parameter groups
Database Maintenance:
- Regular VACUUM ANALYZE
- Index rebuilding
- Statistics updates
- See Database Maintenance

Performance Testing

Regular performance testing helps identify issues before they impact users.

Performance Test Approach

Baseline Testing:
- Establish performance baselines for key operations
- Document baseline metrics
- Set performance targets
Load Testing:
- Simulate expected user load
- Measure response times under load
- Identify bottlenecks
Stress Testing:
- Test system at 2-3x expected maximum load
- Identify breaking points
- Verify graceful degradation
Endurance Testing:
- Test system under sustained load
- Identify memory leaks or resource exhaustion
- Verify long-term stability

JMeter Test Execution

# Run baseline performance test
jmeter -n -t tests/baseline-test.jmx -l results/baseline-$(date +%Y%m%d).jtl -j logs/baseline-$(date +%Y%m%d).log

# Run load test
jmeter -n -t tests/load-test.jmx -l results/load-$(date +%Y%m%d).jtl -j logs/load-$(date +%Y%m%d).log

# Generate HTML report
jmeter -g results/load-$(date +%Y%m%d).jtl -o reports/load-$(date +%Y%m%d)

Backup and Recovery Operations

Backup Verification

Regular verification of backups is essential to ensure recoverability.

Database Backup Verification

# Check RDS automated backups
aws rds describe-db-snapshots --db-instance-identifier api-connect-db --snapshot-type automated --region us-east-1 --query "DBSnapshots[?SnapshotCreateTime>='$(date -d "yesterday" +%Y-%m-%d)'].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime,Status:Status}" --output table

# Verify backup completeness
aws rds describe-db-snapshots --db-snapshot-identifier [snapshot-id] --region us-east-1 --query "DBSnapshots[*].{Storage:SnapshotType,Encrypted:Encrypted,Status:Status,Progress:PercentProgress}" --output table

Configuration Backup Verification

# Check API Connect configuration backups
aws s3 ls s3://api-connect-backups/config/$(date -d "yesterday" +%Y-%m-%d)/

# Verify backup file integrity
aws s3api head-object --bucket api-connect-backups --key config/$(date -d "yesterday" +%Y-%m-%d)/config-backup.zip

Kubernetes Resource Backup Verification

# Check Kubernetes resource backups
aws s3 ls s3://api-connect-backups/kubernetes/$(date -d "yesterday" +%Y-%m-%d)/

# Verify backup contents
aws s3 cp s3://api-connect-backups/kubernetes/$(date -d "yesterday" +%Y-%m-%d)/resources.yaml /tmp/
kubectl apply --dry-run=client -f /tmp/resources.yaml

Periodic Recovery Testing

Recovery testing should be performed regularly to validate backup effectiveness.

Monthly Recovery Test Procedure

Database Recovery Test:

# Create test instance from snapshot
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier api-connect-test-restore \
  --db-snapshot-identifier [snapshot-id] \
  --db-instance-class db.t3.medium \
  --no-multi-az \
  --region us-east-1

# Verify database access
PGPASSWORD=password psql -h api-connect-test-restore.abcdefghijk.us-east-1.rds.amazonaws.com -U apic_admin -d apic_db -c "SELECT count(*) FROM users;"

# Clean up test instance
aws rds delete-db-instance --db-instance-identifier api-connect-test-restore --skip-final-snapshot --region us-east-1

Configuration Recovery Test:

# Set up test environment
kubectl create namespace api-connect-recovery-test

# Apply recovered resources
kubectl apply -f recovered-resources.yaml -n api-connect-recovery-test

# Verify resources created successfully
kubectl get all -n api-connect-recovery-test

# Clean up
kubectl delete namespace api-connect-recovery-test

Point-In-Time Recovery Procedure

For database recovery to a specific point in time:

# Restore database to point in time
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier api-connect-db \
  --target-db-instance-identifier api-connect-recovered \
  --restore-time $(date -u -d "2023-06-15 14:30:00" +%Y-%m-%dT%H:%M:%SZ) \
  --db-instance-class db.m5.large \
  --region us-east-1

# Update database connection in Kubernetes
# 1. Update configmap with new endpoint
kubectl edit configmap database-config -n api-connect
# 2. Restart pods to pick up new configuration
kubectl rollout restart deployment -n api-connect

For detailed recovery procedures, see Disaster Recovery.

User Management Operations

User Onboarding

API Connect Administrator Onboarding

Request Processing:
- Validate user request and approvals
- Determine appropriate role assignment

Platform Role Assignment:

# Create platform role binding
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: api-connect-admin-binding-[username]
  namespace: api-connect
subjects:
- kind: User
  name: [user-email]
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: api-connect-admin
  apiGroup: rbac.authorization.k8s.io
EOF

API Manager Role Assignment:
- Log in to API Manager UI
- Navigate to User Management
- Add user with appropriate role
- Configure LDAP/SAML mapping if applicable
Access Verification:
- Verify user can log in to API Manager
- Confirm appropriate permissions
- Document access granted

API Developer Onboarding

Process Developer Request:
- Validate user and organization
- Determine appropriate access level
Create Developer Account:
- Use API Manager UI to create account
- Assign to appropriate organization
- Set correct permission level
API Product Access:
- Grant access to required API products
- Configure rate limits and quotas
- Set subscription approval requirements
Notification:
- Send welcome email with access instructions
- Provide documentation links
- Specify support channels

User Offboarding

Administrator Offboarding

Revoke Platform Access:

# Remove platform role binding
kubectl delete rolebinding api-connect-admin-binding-[username] -n api-connect

Revoke API Manager Access:
- Log in to API Manager UI
- Navigate to User Management
- Deactivate user account
- Remove role assignments
Audit Access Removal:
- Verify user can no longer access systems
- Document access removal
- Update access records

Developer Offboarding

Revoke Developer Portal Access:
- Deactivate user in Developer Portal
- Revoke API subscriptions if necessary
- Transfer application ownership if needed
Client Credential Management:
- Invalidate client secrets
- Revoke OAuth tokens
- Deactivate API keys
Notification:
- Notify relevant stakeholders
- Document offboarding completion

User Access Review

Regular access reviews should be conducted to maintain security.

Quarterly Access Review Process

Generate Access Reports:

# Get platform role bindings
kubectl get rolebinding -n api-connect > access-report-platform.txt

# Export API Manager users (using UI or APIs)

Review Current Access:
- Compare against authorized user list
- Identify discrepancies
- Document findings
Remediate Issues:
- Remove unauthorized access
- Update access documentation
- Adjust processes if needed
Documentation:
- Document review completion
- Update access records
- Report to security team

Common Operational Tasks

API Management

Publishing a New API

API Creation:
- Develop API in API Manager
- Configure security and policies
- Set rate limits and quotas
Testing:
- Test API functionality
- Verify security controls
- Check performance characteristics
Publication:
- Publish to appropriate catalog
- Configure visibility settings
- Set subscription requirements
Verification:
- Verify API appears in Developer Portal
- Test API through gateway
- Check analytics capture

Managing API Lifecycle

API Versioning:
- Create new version in API Manager
- Update documentation
- Deprecate old version if needed
API Retirement:
- Notify subscribed users
- Set deprecation period
- Configure sunset header
- Eventually retire API
API Product Management:
- Create product bundles
- Configure plans and pricing
- Manage visibility

OAuth Management

OAuth Client Creation

Client Registration:
- Register new OAuth client
- Generate client ID and secret
- Configure redirect URIs
- Set appropriate scopes
Client Testing:
- Test token acquisition
- Verify scope enforcement
- Validate token usage

Token Management

Token Inspection:

# Check token information
curl -X GET "https://api-connect-oauth.example.com/api/token/[token-id]" \
  -H "Authorization: Bearer [admin-token]"

Token Revocation:

# Revoke specific token
curl -X POST "https://api-connect-oauth.example.com/api/token/revoke" \
  -H "Content-Type: application/json" \
  -d '{"token": "[token-to-revoke]", "client_id": "[client-id]"}'

Client Secret Reset:

# Reset client secret
curl -X POST "https://api-connect-oauth.example.com/api/clients/[client-id]/reset-secret" \
  -H "Authorization: Bearer [admin-token]"

Rate Limit Management

Modifying Rate Limits

Update Global Rate Limits:

# Edit rate limit configuration
kubectl edit configmap rate-limit-config -n api-connect
# Update global rate limits

# Apply changes
kubectl rollout restart deployment gateway-deployment -n api-connect

Update API-Specific Rate Limits:
- Log in to API Manager UI
- Navigate to API settings
- Update rate limit policy
- Republish API
Update Plan Rate Limits:
- Log in to API Manager UI
- Navigate to Plans
- Update rate limits for plan
- Update products using the plan

Monitoring Rate Limit Enforcement

# Check rate limit events in logs
kubectl logs -n api-connect -l app=gateway | grep -i "rate limit" | tail -100

# Query Splunk for rate limit events
index=api_connect sourcetype=gateway-logs "rate limit exceeded" | timechart count by client_id

Certificate Management

Certificate Expiration Monitoring

# Check ACM certificate expiration
aws acm list-certificates --region us-east-1 | jq -r '.CertificateSummaryList[].CertificateArn' | while read arn; do
  aws acm describe-certificate --certificate-arn $arn --region us-east-1 | jq -r '.Certificate | "\(.DomainName) expires on \(.NotAfter)"'
done

# Check Kubernetes TLS secrets
kubectl get secrets -n api-connect | grep tls | awk '{print $1}' | while read secret; do
  kubectl get secret $secret -n api-connect -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddate
done

For detailed certificate management procedures, see Certificate Management.

Environment-Specific Operations

Development Environment

Purpose: Active development and testing
User Access: Broader access for development team
Stability Requirements: Lower than production
Maintenance Windows: More flexible, less formality
Data Requirements: Test data, may be refreshed regularly

Operational Focus:

Support development activities
Quick resolution of blocking issues
Regular environment refreshes
Flexible configuration changes

Special Operations:

# Refresh development environment
kubectl delete pods --all -n api-connect

# Reset development database
kubectl exec -it postgres-util -n api-connect -- bash -c "PGPASSWORD=password psql -h api-connect-dev-db.cluster-xyz.us-east-1.rds.amazonaws.com -U apic_admin -d apic_db -c 'DELETE FROM users WHERE username NOT LIKE \"admin%\";'"

# Install development tools
kubectl apply -f dev-tools.yaml -n api-connect

Testing Environment

Purpose: QA and automated testing
User Access: QA team, limited development access
Stability Requirements: Moderate, stable during test cycles
Maintenance Windows: Coordinated with testing schedules
Data Requirements: Controlled test data sets

Operational Focus:

Support testing activities
Maintain test data integrity
Coordinate with test automation
Track environment differences from production

Special Operations:

# Reset test data
kubectl exec -it [pod-name] -n api-connect -- bash -c "cd /opt/test-data && ./reset-test-data.sh"

# Run test suite
kubectl create job --from=cronjob/integration-tests manual-test-run -n api-connect

# Check test results
kubectl logs -n api-connect -l job-name=manual-test-run

Staging Environment

Purpose: Pre-production validation
User Access: Limited to operations and release teams
Stability Requirements: High, mirror production
Maintenance Windows: Simulated production windows
Data Requirements: Production-like data (anonymized if needed)

Operational Focus:

Validate production changes
Performance testing
Production-like operations
Security validation

Special Operations:

# Sync staging with production configuration
./sync-staging-with-prod.sh

# Run performance test
kubectl create job --from=cronjob/performance-test perf-test-run -n api-connect

# Validate configuration
kubectl diff -f staging-vs-prod.yaml

Production Environment

Purpose: Business operations
User Access: Highly restricted
Stability Requirements: Maximum stability
Maintenance Windows: Formal, scheduled windows only
Data Requirements: Live business data, high protection

Operational Focus:

Maximize availability and performance
Strict change control
Comprehensive monitoring
Rapid incident response

Special Operations:

# Production health check
./production-health-check.sh

# Check SLO compliance
curl -X GET "https://your-tenant.dynatrace.com/api/v2/slo" \
  -H "Authorization: Api-Token $DYNATRACE_TOKEN" | jq '.slo[] | {name: .name, status: .status}'

# Emergency rollback (requires approval)
kubectl rollout undo deployment [deployment-name] -n api-connect

DR Environment

Purpose: Business continuity
User Access: Extremely limited
Stability Requirements: Match production
Maintenance Windows: Coordinated with production
Data Requirements: Synchronized with production

Operational Focus:

Maintain synchronization
Verify failover readiness
Test recovery procedures
Match production configuration

Special Operations:

# Check replication status
aws rds describe-db-instances --db-instance-identifier api-connect-db-replica --query "DBInstances[0].ReplicaLag" --region us-west-2

# DR readiness check
./verify-dr-readiness.sh

# DR failover test (scheduled)
./dr-failover-test.sh

Troubleshooting Procedures

Flowchart for General Troubleshooting

graph TD
    A[Issue Detected] --> B{Scope?}
    B -->|Platform-wide| C[Check Infrastructure]
    B -->|Component-specific| D[Check Component]
    B -->|API-specific| E[Check API Configuration]
    B -->|User-specific| F[Check User Permissions]
    
    C --> C1[Check Node Status]
    C --> C2[Check Control Plane]
    C --> C3[Check Network Connectivity]
    
    D --> D1[Check Pod Status]
    D --> D2[Check Logs]
    D --> D3[Check Resource Usage]
    
    E --> E1[Check API Definition]
    E --> E2[Check Gateway Policy]
    E --> E3[Check Backend Connectivity]
    
    F --> F1[Check Authentication]
    F --> F2[Check Authorization]
    F --> F3[Check Rate Limits]
    
    C1 --> G[Resolution Steps]
    C2 --> G
    C3 --> G
    D1 --> G
    D2 --> G
    D3 --> G
    E1 --> G
    E2 --> G
    E3 --> G
    F1 --> G
    F2 --> G
    F3 --> G

Common Error Scenarios and Resolutions

API Gateway 5XX Errors

Symptoms:

API clients receiving 5XX responses
Error rate alerts
Backend service connectivity issues

Diagnostic Steps:

Check Gateway pod status:

kubectl get pods -n api-connect -l app=gateway

Check Gateway logs:

kubectl logs -n api-connect -l app=gateway | grep -i error | tail -100

Check backend service connectivity:

kubectl exec -it -n api-connect [gateway-pod-name] -- curl -v [backend-service-url]

Resolution Steps:

If gateway pods are unhealthy, restart them:

kubectl rollout restart deployment gateway-deployment -n api-connect

If backend service is unavailable, check the service:
- Verify service health
- Check network connectivity
- Check security settings

If configuration issue, update gateway config:

kubectl edit configmap gateway-config -n api-connect
# Update configuration as needed

# Apply changes
kubectl rollout restart deployment gateway-deployment -n api-connect

Authentication Failures

Symptoms:

Users unable to log in
OAuth token acquisition failures
Increased 401/403 errors

Diagnostic Steps:

Check authentication logs:

kubectl logs -n api-connect -l app=gateway | grep -i "auth\|authentication\|token" | tail -100

Verify identity provider connectivity:

kubectl exec -it -n api-connect [pod-name] -- curl -v [identity-provider-url]

Check OAuth service status:

kubectl get pods -n api-connect -l app=oauth
kubectl logs -n api-connect -l app=oauth | grep -i error

Resolution Steps:

If identity provider issue:
- Check connectivity
- Verify configuration
- Check if service is operational

If OAuth service issue:

Restart OAuth pods:

kubectl rollout restart deployment oauth-deployment -n api-connect

Check configuration:

kubectl edit configmap oauth-config -n api-connect

If credential issue:
- Reset client credentials
- Verify correct usage
- Check for expired tokens

Performance Degradation

Symptoms:

Increased API response times
Timeout errors
Resource utilization alerts

Diagnostic Steps:

Check resource utilization:

kubectl top pods -n api-connect
kubectl top nodes

Check for slow database queries:
- Review RDS Performance Insights
- Check for long-running transactions
Check for high traffic patterns:
- Review API traffic metrics
- Look for unusual access patterns

Resolution Steps:

If resource constraints:

Scale out resources:

kubectl scale deployment gateway-deployment -n api-connect --replicas=8

Increase resource limits:

kubectl edit deployment gateway-deployment -n api-connect
# Update resource limits

If database performance:

Optimize slow queries
Check connection pool

Consider instance scaling:

aws rds modify-db-instance --db-instance-identifier api-connect-db --db-instance-class db.m5.2xlarge --apply-immediately --region us-east-1

If traffic pattern issue:
- Implement or adjust rate limiting
- Check for abusive clients
- Consider traffic shaping

Kubernetes Issues

Symptoms:

Pod scheduling failures
Control plane issues
Network connectivity problems

Diagnostic Steps:

Check node status:

kubectl get nodes
kubectl describe node [problem-node]

Check pod events:

kubectl get events -n api-connect --sort-by=.metadata.creationTimestamp | tail -50

Check control plane status:
```
kubectl get componentstatuses
```

Resolution Steps:

If node issues:

Drain problematic node:

kubectl drain [node-name] --ignore-daemonsets --delete-emptydir-data

Investigate node problems:

aws ec2 describe-instances --instance-id [instance-id] --region us-east-1

Replace if necessary:

aws ec2 terminate-instances --instance-ids [instance-id] --region us-east-1

If pod scheduling issues:
- Check resource availability
- Check node selectors/taints
- Check PV/PVC status
If control plane issues:
- Contact AWS for EKS control plane issues
- Check for AWS status page incidents
- Verify networking to control plane

Operational Metrics and Reporting

Regular Reports

Report	Frequency	Audience	Content
Daily Status Report	Daily	SRE Team	System health, incidents, alerts
Weekly Performance Report	Weekly	SRE Team, Product Owners	Performance trends, capacity usage, API metrics
Monthly SLO Report	Monthly	SRE Team, Management	SLO compliance, availability, incident summary
Quarterly Business Review	Quarterly	Product Team, Executive	Strategic metrics, growth trends, capacity needs

Operational Metrics

Metric Category	Example Metrics	Reporting Frequency
Availability	Uptime percentage, outage count/duration	Daily, Weekly, Monthly
Performance	Response time trends, throughput patterns	Weekly
Capacity	Resource utilization trends, scaling events	Weekly
Incidents	Count by severity, MTTR, recurring issues	Weekly, Monthly
SLOs	SLO compliance, error budget consumption	Monthly
Business Metrics	API call volume, user growth, revenue impact	Monthly, Quarterly

Report Generation

// Jenkins pipeline for report generation
pipeline {
    agent any
    triggers {
        cron('0 7 * * 1') // Weekly on Monday at 7 AM
    }
    stages {
        stage('Gather Metrics') {
            steps {
                sh ```
                # Get availability data
                curl -X GET "https://your-tenant.dynatrace.com/api/v2/metrics/query?metricSelector=builtin:service.availability:filter(eq(service.name,API%20Gateway))&from=-7d&to=now" -H "Authorization: Api-Token $DYNATRACE_TOKEN" > availability.json
                
                # Get performance data
                curl -X GET "https://your-tenant.dynatrace.com/api/v2/metrics/query?metricSelector=builtin:service.response.time:filter(eq(service.name,API%20Gateway)):splitBy():avg&from=-7d&to=now" -H "Authorization: Api-Token $DYNATRACE_TOKEN" > performance.json
                
                # Get utilization data
                kubectl top nodes > node-utilization.txt
                kubectl top pods -n api-connect > pod-utilization.txt
                
                # Get SLO data
                curl -X GET "https://your-tenant.dynatrace.com/api/v2/slo" -H "Authorization: Api-Token $DYNATRACE_TOKEN" > slo.json
                ```
            }
        }
        stage('Generate Report') {
            steps {
                sh ```
                # Run report generation script
                python generate_weekly_report.py > weekly-report.html
                ```
            }
        }
        stage('Distribute Report') {
            steps {
                emailext body: '${FILE,path="weekly-report.html"}',
                    mimeType: 'text/html',
                    subject: "API Connect Weekly Performance Report - ${BUILD_TIMESTAMP}",
                    to: '[email protected],[email protected]'
            }
        }
    }
    post {
        always {
            archiveArtifacts artifacts: 'weekly-report.html', fingerprint: true
        }
    }
}

References and Resources

Quick Reference Commands

Task	Command
Check node status	`kubectl get nodes`
Check pod status	`kubectl get pods -n api-connect`
View pod logs	`kubectl logs -n api-connect [pod-name]`
Check pod resource usage	`kubectl top pods -n api-connect`
Restart a deployment	`kubectl rollout restart deployment [deployment-name] -n api-connect`
Scale a deployment	`kubectl scale deployment [deployment-name] -n api-connect --replicas=[count]`
Check service status	`kubectl get svc -n api-connect`
Check recent events	`kubectl get events -n api-connect --sort-by=.metadata.creationTimestamp`
Execute command in pod	`kubectl exec -it [pod-name] -n api-connect -- [command]`
Check RDS status	`aws rds describe-db-instances --db-instance-identifier api-connect-db --region us-east-1`
Check recent backups	`aws rds describe-db-snapshots --db-instance-identifier api-connect-db --snapshot-type automated --region us-east-1`
Check certificate expiry	`aws acm describe-certificate --certificate-arn [arn] --region us-east-1`

Operational Checklists

Daily Operational Checklist

## API Connect Daily Health Check

### Infrastructure
- [ ] All nodes are in Ready state
- [ ] All pods are in Running state (or Completed for jobs)
- [ ] Resource utilization within normal ranges
- [ ] No concerning events in Kubernetes logs

### API Connect Components
- [ ] Gateway services operational
- [ ] Management services operational
- [ ] Portal services operational
- [ ] Analytics services operational

### Database
- [ ] Database instance healthy
- [ ] Connection counts within normal ranges
- [ ] No slow query issues
- [ ] Backup completed successfully

### Security
- [ ] No unusual authentication failures
- [ ] No suspicious access patterns
- [ ] Certificate status validated

### Monitoring
- [ ] All monitoring systems operational
- [ ] SLOs within targets
- [ ] No unacknowledged alerts

Weekly Operational Checklist

## API Connect Weekly Operational Review

### Performance
- [ ] Review performance trends for all components
- [ ] Identify any degradation patterns
- [ ] Check for resource bottlenecks
- [ ] Validate API response times

### Capacity
- [ ] Review resource utilization trends
- [ ] Check for approaching thresholds
- [ ] Update capacity forecasts if needed
- [ ] Plan for any scaling needs

### Backups
- [ ] Verify all backups completed successfully
- [ ] Check backup retention compliance
- [ ] Perform recovery test if scheduled
- [ ] Validate backup procedures

### Security
- [ ] Review security events and logs
- [ ] Check for certificate expirations
- [ ] Validate access controls
- [ ] Review security scan results

### Maintenance
- [ ] Review upcoming maintenance needs
- [ ] Schedule required maintenance
- [ ] Verify patch status
- [ ] Update documentation as needed

Operational Documentation

The SRE team should maintain the following operational documentation:

Runbooks: Detailed procedures for specific tasks (this document)
Incident Postmortems: Documentation of past incidents and resolution
Change Logs: Record of all changes made to the platform
SLO Reports: Regular reporting on SLO compliance
Capacity Plans: Documentation of capacity planning and forecasts
Architecture Diagrams: Up-to-date diagrams of the platform architecture
Knowledge Base: Collection of troubleshooting tips and solutions

Contact Information

Role	Contact	Availability
SRE Team	[email protected]	24/7 via Teams
Database Team	[email protected]	Business hours + on-call
Network Team	[email protected]	Business hours + on-call
Security Team	[email protected]	Business hours + on-call
Product Team	[email protected]	Business hours
IBM Support	IBM Support Portal (Case #IBM-12345)	24/7 with support contract
AWS Support	AWS Support Portal (Account #AWS-67890)	24/7 with support contract