Observability - CodySchluenz/tester GitHub Wiki
API Connect Observability
Overview
This page documents the monitoring, logging, and observability practices for our IBM API Connect platform running on AWS EKS.
Monitoring Strategy
Our monitoring approach follows these key principles:
- Complete visibility across all platform components
- Proactive detection of issues before they impact users
- Clear ownership and escalation paths for alerts
- Comprehensive logging for troubleshooting and audit
- Metrics-driven performance analysis and capacity planning
Monitoring Tools
Dynatrace
Dynatrace provides our primary Application Performance Monitoring (APM) solution with full-stack visibility across our API Connect platform.
Dynatrace Environment
Key Dashboards
Dashboard |
Purpose |
Link |
Primary Audience |
API Connect Overview |
Platform health summary |
Dashboard Link |
All Teams |
Gateway Performance |
Detailed gateway metrics |
Dashboard Link |
SRE Team |
API Performance |
API-level performance metrics |
Dashboard Link |
SRE, API Teams |
Infrastructure Health |
EKS and AWS resources |
Dashboard Link |
SRE Team |
Business Impact |
Business KPIs and user experience |
Dashboard Link |
Product, Business Teams |
Synthetic Monitors
Problem Detection
Dynatrace automatically detects problems using anomaly detection and configured thresholds. Problem settings are configured in:
Splunk
Splunk provides our log aggregation, analysis, and search capabilities.
Splunk Environment
- Splunk Instance
- Main Index:
api_connect
- Retention: 30 days hot, 90 days cold
Key Dashboards
Dashboard |
Purpose |
Link |
Primary Audience |
API Connect Overview |
Log summary and error trends |
Dashboard Link |
SRE Team |
Error Analysis |
Detailed error investigation |
Dashboard Link |
SRE, Development |
Security Events |
Security-focused events |
Dashboard Link |
Security Team |
Audit Trail |
User and system actions |
Dashboard Link |
Compliance, Security |
API Usage |
API traffic and consumption |
Dashboard Link |
Product, Business Teams |
Saved Searches
Search Name |
Purpose |
Schedule |
Link |
Critical Errors |
Detect critical error conditions |
Every 5 min |
Search Link |
Rate Limit Breaches |
Monitor rate limiting |
Every 15 min |
Search Link |
Certificate Issues |
Detect TLS/cert problems |
Hourly |
Search Link |
Unusual Patterns |
ML-based anomaly detection |
Hourly |
Search Link |
Log Management
Log Sources
Component |
Log Type |
Collection Method |
Destination |
API Gateway |
Application Logs |
Fluentd DaemonSet |
Splunk api_connect index |
API Gateway |
Access Logs |
Fluentd DaemonSet |
Splunk api_connect_access index |
API Manager |
Application Logs |
Fluentd DaemonSet |
Splunk api_connect index |
Developer Portal |
Application Logs |
Fluentd DaemonSet |
Splunk api_connect index |
Kubernetes |
Container Logs |
Fluentd DaemonSet |
Splunk kubernetes index |
Kubernetes |
Control Plane Logs |
CloudWatch |
Splunk kubernetes_control index |
AWS Load Balancers |
Access Logs |
S3 + Lambda |
Splunk aws_elb index |
AWS CloudTrail |
API Activity |
CloudTrail |
Splunk aws_cloudtrail index |
Log Retention Policy
Environment |
Hot Storage |
Warm Storage |
Cold Storage |
Production |
30 days |
90 days |
1 year |
Non-Production |
15 days |
45 days |
90 days |
Metrics & KPIs
Platform Health Metrics
Metric |
Description |
Source |
Alert Threshold |
Owner |
Gateway Availability |
% of successful health checks |
Dynatrace Synthetic |
<99.9% |
SRE Team |
API Success Rate |
% of API calls with 2xx/3xx status |
Dynatrace |
<99.5% |
SRE Team |
Avg Response Time |
Average API response time |
Dynatrace |
>500ms |
SRE Team |
Error Rate |
% of 5xx responses |
Dynatrace |
>1% |
SRE Team |
CPU Utilization |
Node-level CPU usage |
Dynatrace |
>80% |
SRE Team |
Memory Utilization |
Node-level memory usage |
Dynatrace |
>80% |
SRE Team |
Pod Restarts |
Count of pod restarts |
Kubernetes API |
>3 in 15min |
SRE Team |
Database Connections |
Number of active DB connections |
RDS Metrics |
>80% pool size |
DBA Team |
Alerting Configuration
Alert Severity Levels
Severity |
Description |
Response Time |
Notification Method |
Critical (P1) |
Service unavailability, data loss risk |
<15 minutes |
PagerDuty, SMS, Email |
High (P2) |
Degraded performance, potential customer impact |
<30 minutes |
PagerDuty, Email |
Medium (P3) |
Non-customer impacting issues |
<2 hours |
Email, Slack |
Low (P4) |
Minor issues, informational |
Next business day |
Slack |
Dynatrace Alerting Profiles
Profile |
Purpose |
Configuration |
Notification |
Critical Infrastructure |
Gateway, Manager service outages |
Configuration Link |
PagerDuty |
Performance Issues |
Slow response, high resource usage |
Configuration Link |
Email, Slack |
Error Spikes |
Unusual error patterns |
Configuration Link |
Email, Slack |
Security Issues |
Certificate, authentication problems |
Configuration Link |
Email, PagerDuty |
ServiceNow
Alerts from Dynatrace and Splunk automatically create tickets in ServiceNow using these integrations:
Configuration details:
Observability as Code
Our observability configurations are managed as code:
SLIs/SLOs
Service Level Indicators
Service |
SLI |
Measurement Method |
Current Performance |
API Gateway |
Availability |
Synthetic tests |
SLI Dashboard |
API Gateway |
Latency |
API response times |
SLI Dashboard |
Developer Portal |
Availability |
Synthetic tests |
SLI Dashboard |
API Manager |
Availability |
Synthetic tests |
SLI Dashboard |
API Manager |
Functionality |
Key operation tests |
SLI Dashboard |
Service Level Objectives
Service |
SLO |
Target |
Measurement Window |
Status |
API Gateway |
Availability |
99.95% |
30-day rolling |
SLO Dashboard |
API Gateway |
Latency (p95) |
<300ms |
30-day rolling |
SLO Dashboard |
Developer Portal |
Availability |
99.9% |
30-day rolling |
SLO Dashboard |
API Manager |
Availability |
99.9% |
30-day rolling |
SLO Dashboard |
Overall Platform |
Error Rate |
<0.1% |
30-day rolling |
SLO Dashboard |
References