Main Runbook - CodySchluenz/tester GitHub Wiki

IBM API Connect Service Overview

Service Description

IBM API Connect is an enterprise-grade API management platform deployed on AWS EKS that enables the full API lifecycle including creation, security, management, testing, and monitoring of APIs. The platform serves as the central hub for all API activities within the organization.

Business Purpose

API Connect provides:

Creation and documentation of REST and SOAP APIs
Secure runtime execution of API calls
Policy enforcement and access control
API analytics and performance monitoring
Self-service developer portal for API consumers
API lifecycle management

Business Impact

This platform is business-critical as it:

Enables integration between internal systems
Provides controlled access to enterprise data
Supports partner integrations
Enables digital product capabilities
Facilitates mobile application functionality

Architecture Overview

API Connect consists of four primary components deployed on AWS EKS. For detailed architecture information, see the Architecture wiki page.

High-Level Architecture

graph TD
    subgraph "API Connect Platform"
        G[API Gateway<br>DataPower] <--> M[Management<br>Subsystem]
        M <--> P[Developer Portal]
        G --> A[Analytics Subsystem]
        M --> A
        P --> A
    end
    
    C[API Consumers] --> G
    D[API Developers] --> M
    D --> P

Component Descriptions

Component	Description	Criticality	Detailed Documentation
API Gateway (DataPower)	Runtime component that processes API requests, enforces security policies, validates requests, and routes to backend services.	Critical - Directly impacts API consumers	Gateway Runbook
Management Subsystem	Provides API lifecycle management, including creation, configuration, testing, and publishing. Includes API Manager UI and backend services.	High - Required for API administration	Management Runbook
Developer Portal	Self-service portal for API consumers to discover, explore, test, and subscribe to APIs.	Medium - Affects API discovery and onboarding	Portal Runbook
Analytics Subsystem	Collects, processes, and visualizes API usage metrics and operational data.	Medium - Affects visibility into platform usage	Analytics Runbook

AWS Infrastructure

The platform is hosted on AWS with the following key services. For detailed infrastructure information, see the Architecture#physical-architecture wiki page.

AWS Service	Usage	Configuration
EKS	Kubernetes orchestration	Version 1.29, deployed across 3 AZs
EC2	Worker nodes for EKS	Auto-scaling node groups with right-sized instances
RDS	PostgreSQL database	Multi-AZ deployment with automated backups
ALB	Load balancing	TLS termination, WAF integration
Route53	DNS management	Health checks, failover configuration
S3	Object storage	Backup storage, artifacts, logging
KMS	Encryption	Secrets and data encryption

Network Architecture

The platform utilizes a secure network design. For detailed network architecture, see the Architecture#network-architecture wiki page.

VPC with public and private subnets across 3 AZs
API traffic flows through public ALB to private Gateway services
Internal components operate in private subnets
NAT Gateways for outbound connectivity
VPC endpoints for AWS service access
Network ACLs and security groups for traffic control

Environments

Environment	Purpose	URL	AWS Region	Access
Production	Business operations	api.example.com	us-east-1	Restricted
DR	Disaster recovery	dr.api.example.com	us-west-2	Emergency only
Staging	Pre-production validation	staging-api.example.com	us-east-1	Limited
Testing	QA and automated testing	test-api.example.com	us-east-2	Team access
Development	Development work	dev-api.example.com	us-east-2	Developer access

For environment-specific details, see the Architecture#environment-comparison wiki page.

Service Level Objectives

Service	Metric	Target	Measurement
API Gateway	Availability	99.95%	30-day rolling window
API Gateway	Response Time (p95)	< 300ms	30-day rolling window
All Services	Error Rate	< 0.1%	30-day rolling window
Management Services	Availability	99.9%	30-day rolling window
Developer Portal	Availability	99.9%	30-day rolling window

For detailed SLO definitions and monitoring, see the Observability#slis-slos wiki page.

Maintenance Windows

Component	Window	Frequency	Impact
Gateway Services	None (Rolling updates)	As needed	No downtime
Management Services	Sunday 2:00 AM - 4:00 AM EST	Monthly	UI unavailable
Developer Portal	Sunday 2:00 AM - 4:00 AM EST	Monthly	Portal unavailable
Analytics Services	Sunday 2:00 AM - 4:00 AM EST	Monthly	Analytics unavailable

For detailed maintenance procedures, see the Maintenance-Runbook wiki page.

Monitoring and Alerting

For comprehensive monitoring information, see the Observability wiki page.

Monitoring Tools

Tool	Purpose	Access
Dynatrace	APM, synthetic monitoring, alerting	Dynatrace Portal
Splunk	Log aggregation and analysis	Splunk Portal
AWS CloudWatch	AWS resource monitoring	AWS Console
ServiceNow	Incident management	ServiceNow Portal

Key Dashboards

Dashboard	Purpose	URL
API Connect Overview	Platform-wide health	Dynatrace Dashboard
Gateway Performance	API Gateway metrics	Dynatrace Dashboard
SLO Tracking	SLO compliance	Dynatrace Dashboard
Security Events	Security monitoring	Splunk Dashboard

For dashboard details, see the Observability#dashboards wiki page.

Critical Metrics

Metric	Description	Warning Threshold	Critical Threshold
API Success Rate	% of API calls with 2xx/3xx status	< 99.5%	< 99%
Response Time (p95)	95th percentile of response times	> 300ms	> 500ms
Error Rate	% of 5XX responses	> 0.1%	> 1%
CPU Utilization	Resource usage	> 70%	> 85%
Active DB Connections	Database connections	> 70% of max	> 85% of max

For the complete metrics catalog, see the Observability#key-metrics wiki page.

Authentication and Access Control

For detailed security information, see the Access wiki page.

Authentication Methods

Interface	Method	Provider	Notes
Management UI	SAML	Corporate SSO (Okta)	MFA required
Developer Portal	OAuth 2.0 / OpenID Connect	Okta	Self-service registration with approval
API Gateway	Multiple (API Key, OAuth 2.0, JWT, mTLS)	API Connect	Configurable per API
Kubernetes	OIDC	Corporate SSO (Okta)	Role-based access

See Access#authentication-methods for detailed authentication configurations.

Access Control Models

API Connect implements a comprehensive RBAC model. For detailed access control information, see the Access#authorization-models wiki page.

Role	Description	Access Scope
Administrator	Full platform control	Restricted to SRE team
Operator	Runtime management	SRE team
API Developer	API creation and testing	Development teams
API Administrator	API lifecycle management	API product owners
Consumer Organization Owner	Consumer organization management	External partners
API Consumer	API usage	External developers

Backup and Disaster Recovery

Backup Strategy

Component	Backup Method	Frequency	Retention
RDS Database	Automated snapshots	Daily	30 days
Configuration	S3 backups	Hourly	90 days
API Definitions	Git repository	Continuous	Indefinite
Platform State	EKS resource exports	Daily	30 days

For detailed backup procedures, see the Infrastructure-Runbook#backup-and-disaster-recovery wiki page.

Disaster Recovery

Scenario	Strategy	RTO	RPO
AZ Failure	Multi-AZ redundancy	Automatic	0 minutes
Region Failure	Cross-region DR environment	30 minutes	5 minutes
Database Corruption	Point-in-time recovery	2 hours	5 minutes
Configuration Error	Configuration rollback	30 minutes	Depends on detection

For complete disaster recovery procedures, see the Infrastructure-Runbook#disaster-recovery-procedures wiki page.

Support and Escalation

Support Tiers

Tier	Team	Response Time	Contact Method
L1	24x7 Operations	15 minutes	ServiceNow, PagerDuty
L2	SRE Team	30 minutes	ServiceNow, MS Teams
L3	Platform Engineering	1 hour	ServiceNow, MS Teams
Vendor	IBM Support	Based on severity	IBM Support Portal

Escalation Path

For critical incidents (P1/P2):

Primary On-Call Engineer (immediate)
Secondary On-Call Engineer (+15 minutes)
SRE Team Lead (+30 minutes)
Engineering Manager (+1 hour)
Director of Engineering (+2 hours)

For detailed incident response procedures, see the Operations-Runbook#incident-management wiki page.

Contact Information

Role	Contact	Availability
SRE Team	[email protected]	24/7 via Teams
Platform Engineering	[email protected]	Business hours + on-call
IBM Support	IBM Support Portal (Case #IBM-12345)	24/7 with support contract
AWS Support	AWS Support Portal (Account #AWS-67890)	24/7 with Business Support

For complete contact details, see the Operations-Runbook#contact-details wiki page.

Documentation References

Technical Documentation

Architecture Documentation - Detailed platform design
Observability Documentation - Monitoring and alerting
Access Documentation - Security and access control
SDLC Documentation - Development and deployment

Runbooks

Gateway Runbook - Gateway troubleshooting
Management Runbook - Management troubleshooting
Portal Runbook - Developer Portal troubleshooting
Analytics Runbook - Analytics troubleshooting
Infrastructure Runbook - AWS and Kubernetes issues
Database Runbook - Database operations
Maintenance Runbook - Planned maintenance procedures
Operations Runbook - Day-to-day operational procedures