Observability - CodySchluenz/tester GitHub Wiki

API Connect Observability

Overview

This page documents the monitoring, logging, and observability practices for our IBM API Connect platform running on AWS EKS.

Monitoring Strategy

Our monitoring approach follows these key principles:

  • Complete visibility across all platform components
  • Proactive detection of issues before they impact users
  • Clear ownership and escalation paths for alerts
  • Comprehensive logging for troubleshooting and audit
  • Metrics-driven performance analysis and capacity planning

Monitoring Tools

Dynatrace

Dynatrace provides our primary Application Performance Monitoring (APM) solution with full-stack visibility across our API Connect platform.

Dynatrace Environment

Key Dashboards

Dashboard Purpose Link Primary Audience
API Connect Overview Platform health summary Dashboard Link All Teams
Gateway Performance Detailed gateway metrics Dashboard Link SRE Team
API Performance API-level performance metrics Dashboard Link SRE, API Teams
Infrastructure Health EKS and AWS resources Dashboard Link SRE Team
Business Impact Business KPIs and user experience Dashboard Link Product, Business Teams

Synthetic Monitors

Monitor Name Test Frequency Type Link
API Gateway Health 5 minutes HTTP Monitor Link
OAuth Flow 15 minutes Browser Monitor Link
Developer Portal 10 minutes Browser Monitor Link
Critical APIs 5 minutes HTTP Monitor Link

Problem Detection

Dynatrace automatically detects problems using anomaly detection and configured thresholds. Problem settings are configured in:

Splunk

Splunk provides our log aggregation, analysis, and search capabilities.

Splunk Environment

  • Splunk Instance
  • Main Index: api_connect
  • Retention: 30 days hot, 90 days cold

Key Dashboards

Dashboard Purpose Link Primary Audience
API Connect Overview Log summary and error trends Dashboard Link SRE Team
Error Analysis Detailed error investigation Dashboard Link SRE, Development
Security Events Security-focused events Dashboard Link Security Team
Audit Trail User and system actions Dashboard Link Compliance, Security
API Usage API traffic and consumption Dashboard Link Product, Business Teams

Saved Searches

Search Name Purpose Schedule Link
Critical Errors Detect critical error conditions Every 5 min Search Link
Rate Limit Breaches Monitor rate limiting Every 15 min Search Link
Certificate Issues Detect TLS/cert problems Hourly Search Link
Unusual Patterns ML-based anomaly detection Hourly Search Link

Log Management

Log Sources

Component Log Type Collection Method Destination
API Gateway Application Logs Fluentd DaemonSet Splunk api_connect index
API Gateway Access Logs Fluentd DaemonSet Splunk api_connect_access index
API Manager Application Logs Fluentd DaemonSet Splunk api_connect index
Developer Portal Application Logs Fluentd DaemonSet Splunk api_connect index
Kubernetes Container Logs Fluentd DaemonSet Splunk kubernetes index
Kubernetes Control Plane Logs CloudWatch Splunk kubernetes_control index
AWS Load Balancers Access Logs S3 + Lambda Splunk aws_elb index
AWS CloudTrail API Activity CloudTrail Splunk aws_cloudtrail index

Log Retention Policy

Environment Hot Storage Warm Storage Cold Storage
Production 30 days 90 days 1 year
Non-Production 15 days 45 days 90 days

Metrics & KPIs

Platform Health Metrics

Metric Description Source Alert Threshold Owner
Gateway Availability % of successful health checks Dynatrace Synthetic <99.9% SRE Team
API Success Rate % of API calls with 2xx/3xx status Dynatrace <99.5% SRE Team
Avg Response Time Average API response time Dynatrace >500ms SRE Team
Error Rate % of 5xx responses Dynatrace >1% SRE Team
CPU Utilization Node-level CPU usage Dynatrace >80% SRE Team
Memory Utilization Node-level memory usage Dynatrace >80% SRE Team
Pod Restarts Count of pod restarts Kubernetes API >3 in 15min SRE Team
Database Connections Number of active DB connections RDS Metrics >80% pool size DBA Team

Alerting Configuration

Alert Severity Levels

Severity Description Response Time Notification Method
Critical (P1) Service unavailability, data loss risk <15 minutes PagerDuty, SMS, Email
High (P2) Degraded performance, potential customer impact <30 minutes PagerDuty, Email
Medium (P3) Non-customer impacting issues <2 hours Email, Slack
Low (P4) Minor issues, informational Next business day Slack

Dynatrace Alerting Profiles

Profile Purpose Configuration Notification
Critical Infrastructure Gateway, Manager service outages Configuration Link PagerDuty
Performance Issues Slow response, high resource usage Configuration Link Email, Slack
Error Spikes Unusual error patterns Configuration Link Email, Slack
Security Issues Certificate, authentication problems Configuration Link Email, PagerDuty

ServiceNow

Alerts from Dynatrace and Splunk automatically create tickets in ServiceNow using these integrations:

Configuration details:

Observability as Code

Our observability configurations are managed as code:

SLIs/SLOs

Service Level Indicators

Service SLI Measurement Method Current Performance
API Gateway Availability Synthetic tests SLI Dashboard
API Gateway Latency API response times SLI Dashboard
Developer Portal Availability Synthetic tests SLI Dashboard
API Manager Availability Synthetic tests SLI Dashboard
API Manager Functionality Key operation tests SLI Dashboard

Service Level Objectives

Service SLO Target Measurement Window Status
API Gateway Availability 99.95% 30-day rolling SLO Dashboard
API Gateway Latency (p95) <300ms 30-day rolling SLO Dashboard
Developer Portal Availability 99.9% 30-day rolling SLO Dashboard
API Manager Availability 99.9% 30-day rolling SLO Dashboard
Overall Platform Error Rate <0.1% 30-day rolling SLO Dashboard

References