[WIP] Alerts For Composio on‐prem - ComposioHQ/helm-charts GitHub Wiki
Team: Composio
Last Updated: 2025-11-03
Environment: Production
Alert 1: 4xx Error rate > 50% on {{span.resource_name}}
Priority: P2 - High
Type: Trace Analytics
Status: Active
Query
trace-analytics("(count[service:apollo resource_name:hono* @hono.response.status_code:[400 TO 499] env:production -resource_name:*identify* -resource_name:*manage* [email protected]_code:401 -resource_name:*DELETE* [email protected]_code:403] / max(count[service:apollo resource_name:hono* env:production [email protected]_code:401 [email protected]_code:401 [email protected]_code:404], 1000)) * 100").last("5m") > 200
Thresholds
- Alert: > 200% (effective 50% after calculation)
- Evaluation Window: last 5 minutes
Notifications
- Alert:
@slack-composio-alerts @pagerduty-oncall - Tags:
service:apollo, env:production, error_type:4xx
Alert 2: 5xx Error rate > 10% on {{span.resource_name}}
Priority: P1 - Critical
Type: Trace Analytics
Status: Active
Query
trace-analytics("(count[service:apollo resource_name:hono* @hono.response.status_code:[500 TO 599] env:production -resource_name:identify* -resource_name:manage* -resource_name:proxy*] / max(count[service:apollo resource_name:hono* env:production -resource_name:identify* -resource_name:manage* -resource_name:manage* -resource_name:proxy*], 300)) * 100").last("5m") > 20
Thresholds
- Alert: > 20% (effective 10% after calculation)
- Evaluation Window: last 5 minutes
Notifications
- Alert:
@slack-composio-critical @pagerduty-oncall - Tags:
service:apollo, env:production, error_type:5xx
Alert 3: ALB High 5XX Error rate
Priority: P1 - Critical
Type: Metric
Status: Active
Query
avg(last_5m):avg:aws.elb.httpcode_elb_5xx{*} > 10
Thresholds
- Alert: > 10 errors
- Evaluation Window: last 5 minutes
Notifications
- Alert:
@slack-infrastructure-critical @pagerduty-oncall - Tags:
service:alb, aws:elb, error_type:5xx
Alert 4: All tool calls failed in last 15 mins
Priority: P1 - Critical
Type: Metric
Status: Active
Query
sum(last_15m):100 * cutoff_min(sum:mercury.tool_call{env:production, error:true} by {toolkit_name,tool_name,toolkit_version}.as_count(), 25) / cutoff_min(sum:mercury.tool_call{env:production} by {toolkit_name,tool_name,toolkit_version}.as_count(), 25) >= 100
Thresholds
- Alert: >= 100% (all calls failing)
- Evaluation Window: last 15 minutes
Notifications
- Alert:
@slack-composio-critical @pagerduty-oncall - Tags:
service:mercury, env:production, alert_type:tool_failure
Alert 5: Apollo Tool Execution Failed Anomaly Monitor
Priority: P2 - High
Type: Anomaly Detection
Status: Active
Query
avg(last_4h):anomalies(avg:system.load.1{*}, 'basic', 2, direction='above', interval=60, alert_window='last_15m', count_default_zero='true') >= 1
Thresholds
- Alert: >= 1 (anomaly detected)
- Evaluation Window: last 4 hours, alert window last 15 minutes
Notifications
- Alert:
@slack-composio-alerts - Tags:
service:apollo, anomaly:tool_execution
Alert 6: Apollo Tool Execution Failures - Anomaly Detection
Priority: P2 - High
Type: Anomaly Detection
Status: Active
Query
avg(last_4h):anomalies(avg:system.load.1{*}, 'basic', 2, direction='both', interval=60, alert_window='last_15m', count_default_zero='true') >= 0.01
Thresholds
- Alert: >= 0.01 (anomaly detected in either direction)
- Evaluation Window: last 4 hours, alert window last 15 minutes
Notifications
- Alert:
@slack-composio-alerts - Tags:
service:apollo, anomaly:tool_execution
Alert 7: CPU load is high with {{threshold}} on {{ecs_service.name}}
Priority: P2 - High
Type: Metric
Status: Active
Query
avg(last_1h):avg:aws.ecs.service.memory_utilization{clustername:prod_cluster} by {servicename} > 75
Thresholds
- Alert: > 75%
- Evaluation Window: last 1 hour
Notifications
- Alert:
@slack-infrastructure-alerts @pagerduty-oncall - Tags:
aws:ecs, cluster:prod_cluster, resource:cpu
Alert 8: Error rate is high on mcp-server production
Priority: P1 - Critical
Type: Anomaly Detection
Status: Active
Query
avg(last_1h):anomalies(avg:trace.http.server.request.errors{env:production,service:mcp-server-next}.as_rate(), 'basic', 2, direction='above', alert_window='last_5m', interval=20, count_default_zero='true') >= 0.9
Thresholds
- Alert: >= 0.9 (anomaly confidence)
- Evaluation Window: last 1 hour, alert window last 5 minutes
Notifications
- Alert:
@slack-composio-critical @pagerduty-oncall - Tags:
service:mcp-server-next, env:production, anomaly:error_rate
Alert 9: High Error Rate on {{functionname.name}} in {{region.name}} for {{aws_account.name}}
Priority: P2 - High
Type: Metric
Status: Active
Query
sum(last_15m):sum:aws.lambda.errors{*} by {functionname,region,aws_account}.as_count() / sum:aws.lambda.invocations{*} by {functionname,region,aws_account}.as_count() >= 0.1
Thresholds
- Alert: >= 10% error rate
- Evaluation Window: last 15 minutes
Notifications
- Alert:
@slack-serverless-alerts @pagerduty-oncall - Tags:
aws:lambda, resource:error_rate
Alert 10: Important Endpoint 5xx Error rate > {{threshold}}%
Priority: P1 - Critical
Type: Trace Analytics
Status: Active
Query
trace-analytics("(count[service:apollo env:production @hono.response.status_code:[500 TO 599] (@hono.request.path:"/api/v3/toolkits" (@hono.request.method:GET AND @hono.request.path:"/api/v3/auth_configs") OR (@hono.request.method:GET AND @hono.request.path:"/api/v3/connected_accounts") OR @hono.request.path:"/api/v3/tools" OR @hono.request.path:"/api/v3/connected_accounts/:nanoid" OR @hono.request.path:"/api/v3/internal/connected_accounts/link/:token" OR @hono.request.path:"/api/v3/tools/:tool_slug")] / max(count[service:apollo env:production (@hono.request.path:"/api/v3/toolkits" (@hono.request.method:GET AND @hono.request.path:"/api/v3/auth_configs") OR (@hono.request.method:GET AND @hono.request.path:"/api/v3/connected_accounts") OR @hono.request.path:"/api/v3/tools" OR @hono.request.path:"/api/v3/tools/execute/:tool_slug" OR @hono.request.path:"/api/v3/connected_accounts/:nanoid" OR @hono.request.path:"/api/v3/internal/connected_accounts/link/:token" OR @hono.request.path:"/api/v3/tools/:tool_slug")], 300)) * 100").last("5m") > 20
Thresholds
- Alert: > 20%
- Evaluation Window: last 5 minutes
Notifications
- Alert:
@slack-composio-critical @pagerduty-oncall - Tags:
service:apollo, env:production, endpoints:critical
Alert 11: Lambda High Error Rate
Priority: P2 - High
Type: Metric
Status: Active
Query
avg(last_5m):sum:aws.lambda.errors{*} / sum:aws.lambda.invocations{*} > 0.05
Thresholds
- Alert: > 5% error rate
- Evaluation Window: last 5 minutes
Notifications
- Alert:
@slack-serverless-alerts - Tags:
aws:lambda, resource:error_rate
Alert 12: Memory usage on {{servicename.name}} is high with {{threshold}}
Priority: P2 - High
Type: Metric
Status: Active
Query
avg(last_1h):avg:aws.ecs.service.memory_utilization{clustername:prod_cluster} by {servicename} > 75
Thresholds
- Alert: > 75%
- Evaluation Window: last 1 hour
Notifications
- Alert:
@slack-infrastructure-alerts @pagerduty-oncall - Tags:
aws:ecs, cluster:prod_cluster, resource:memory
Alert 13: Polling Triggers Workflows are stopping
Priority: P1 - Critical
Type: Metric
Status: Active
Query
avg(last_5m):per_hour(sum:temporal_workflow_completed{namespace:polling-prod.kl3mw,workflow_type:polltriggerworkflow}) / 60 > 10
Thresholds
- Alert: > 10 completions per minute
- Evaluation Window: last 5 minutes
Notifications
- Alert:
@slack-composio-critical @pagerduty-oncall - Tags:
service:temporal, workflow:polling, namespace:polling-prod
Alert 14: Thermos HTTP Request Errors >1%
Priority: P2 - High
Type: Metric
Status: Active
Query
sum(last_5m):sum:trace.http.request.errors{env:production, service:thermos}.as_count() / sum:trace.http.request.hits{env:production, service:thermos}.as_count() * 100 > 1
Thresholds
- Alert: > 1% error rate
- Evaluation Window: last 5 minutes
Notifications
- Alert:
@slack-composio-alerts - Tags:
service:thermos, env:production, error_type:http
Alert 15: Tool anomaly alert
Priority: P2 - High
Type: Anomaly Detection
Status: Active
Query
avg(last_4h):anomalies(cutoff_min(sum:mercury.tool_call{env:production, error:true} by {toolkit_name,tool_name}.as_count(), 5) / cutoff_min(sum:mercury.tool_call{env:production} by {toolkit_name,tool_name}.as_count(), 5) * 100, 'basic', 2, direction='both', interval=60, alert_window='last_15m', count_default_zero='true') >= 1
Thresholds
- Alert: >= 1 (anomaly detected)
- Evaluation Window: last 4 hours, alert window last 15 minutes
Notifications
- Alert:
@slack-composio-alerts - Tags:
service:mercury, env:production, anomaly:tool_errors
Alert 16: [AWS] RDS CPU utilization is high
Priority: P1 - Critical
Type: Metric
Status: Active
Query
max(last_15m):avg:aws.rds.cpuutilization{! dbinstanceidentifier:stagingrds} by {dbinstanceidentifier} > 80
Thresholds
- Alert: > 80%
- Evaluation Window: last 15 minutes
Notifications
- Alert:
@slack-database-critical @pagerduty-dba - Tags:
aws:rds, resource:cpu
Alert 17: [AWS] RDS Storage utilization is high
Priority: P1 - Critical
Type: Metric
Status: Active
Query
avg(last_15m):100 - ((avg:aws.rds.free_storage_space{*} by {dbinstanceidentifier,engine} / avg:aws.rds.total_storage_space{*} by {dbinstanceidentifier,engine}) * 100) > 90
Thresholds
- Alert: > 90% storage used
- Evaluation Window: last 15 minutes
Notifications
- Alert:
@slack-database-critical @pagerduty-dba - Tags:
aws:rds, resource:storage
Alert 18: [Mercury] OTA Module Load Failures
Priority: P2 - High
Type: Log Analytics
Status: Active
Query
logs("\"No module named 'ota'\" env:production").index("*").rollup("count").last("5m") > 1
Thresholds
- Alert: > 1 occurrence
- Evaluation Window: last 5 minutes
Notifications
- Alert:
@slack-composio-alerts - Tags:
service:mercury, env:production, error_type:module_load
Quick Reference Table
| # | Alert Name | Priority | Type | Threshold | Service | Status |
|---|---|---|---|---|---|---|
| 1 | 4xx Error rate > 50% | P2 | Trace | > 200% | Apollo | Active |
| 2 | 5xx Error rate > 10% | P1 | Trace | > 20% | Apollo | Active |
| 3 | ALB High 5XX Error | P1 | Metric | > 10 | ALB | Active |
| 4 | All tool calls failed | P1 | Metric | >= 100% | Mercury | Active |
| 5 | Apollo Tool Exec Anomaly | P2 | Anomaly | >= 1 | Apollo | Active |
| 6 | Apollo Tool Failures Anomaly | P2 | Anomaly | >= 0.01 | Apollo | Active |
| 7 | CPU load high on ECS | P2 | Metric | > 75% | ECS | Active |
| 8 | MCP Server Error Rate | P1 | Anomaly | >= 0.9 | MCP | Active |
| 9 | Lambda High Error Rate | P2 | Metric | >= 10% | Lambda | Active |
| 10 | Important Endpoint 5xx | P1 | Trace | > 20% | Apollo | Active |
| 11 | Lambda High Error Rate | P2 | Metric | > 5% | Lambda | Active |
| 12 | Memory usage high on ECS | P2 | Metric | > 75% | ECS | Active |
| 13 | Polling Triggers Stopping | P1 | Metric | > 10/min | Temporal | Active |
| 14 | Thermos HTTP Errors | P2 | Metric | > 1% | Thermos | Active |
| 15 | Tool anomaly | P2 | Anomaly | >= 1 | Mercury | Active |
| 16 | RDS CPU utilization high | P1 | Metric | > 80% | RDS | Active |
| 17 | RDS Storage high | P1 | Metric | > 90% | RDS | Active |
| 18 | Mercury OTA Module Load | P2 | Log | > 1 | Mercury | Active |
Notes
- All alerts are configured for production environment
- Critical (P1) alerts trigger PagerDuty notifications
- Alert thresholds should be reviewed quarterly
- Update notification channels as team structure changes