[WIP] Alerts For Composio on‐prem - ComposioHQ/helm-charts GitHub Wiki

Team: Composio
Last Updated: 2025-11-03
Environment: Production


Alert 1: 4xx Error rate > 50% on {{span.resource_name}}

Priority: P2 - High
Type: Trace Analytics
Status: Active

Query

trace-analytics("(count[service:apollo resource_name:hono* @hono.response.status_code:[400 TO 499] env:production -resource_name:*identify* -resource_name:*manage* [email protected]_code:401 -resource_name:*DELETE* [email protected]_code:403] / max(count[service:apollo resource_name:hono* env:production [email protected]_code:401 [email protected]_code:401 [email protected]_code:404], 1000)) * 100").last("5m") > 200

Thresholds

  • Alert: > 200% (effective 50% after calculation)
  • Evaluation Window: last 5 minutes

Notifications

  • Alert: @slack-composio-alerts @pagerduty-oncall
  • Tags: service:apollo, env:production, error_type:4xx

Alert 2: 5xx Error rate > 10% on {{span.resource_name}}

Priority: P1 - Critical
Type: Trace Analytics
Status: Active

Query

trace-analytics("(count[service:apollo resource_name:hono* @hono.response.status_code:[500 TO 599] env:production -resource_name:identify* -resource_name:manage* -resource_name:proxy*] / max(count[service:apollo resource_name:hono* env:production -resource_name:identify* -resource_name:manage* -resource_name:manage* -resource_name:proxy*], 300)) * 100").last("5m") > 20

Thresholds

  • Alert: > 20% (effective 10% after calculation)
  • Evaluation Window: last 5 minutes

Notifications

  • Alert: @slack-composio-critical @pagerduty-oncall
  • Tags: service:apollo, env:production, error_type:5xx

Alert 3: ALB High 5XX Error rate

Priority: P1 - Critical
Type: Metric
Status: Active

Query

avg(last_5m):avg:aws.elb.httpcode_elb_5xx{*} > 10

Thresholds

  • Alert: > 10 errors
  • Evaluation Window: last 5 minutes

Notifications

  • Alert: @slack-infrastructure-critical @pagerduty-oncall
  • Tags: service:alb, aws:elb, error_type:5xx

Alert 4: All tool calls failed in last 15 mins

Priority: P1 - Critical
Type: Metric
Status: Active

Query

sum(last_15m):100 * cutoff_min(sum:mercury.tool_call{env:production, error:true} by {toolkit_name,tool_name,toolkit_version}.as_count(), 25) / cutoff_min(sum:mercury.tool_call{env:production} by {toolkit_name,tool_name,toolkit_version}.as_count(), 25) >= 100

Thresholds

  • Alert: >= 100% (all calls failing)
  • Evaluation Window: last 15 minutes

Notifications

  • Alert: @slack-composio-critical @pagerduty-oncall
  • Tags: service:mercury, env:production, alert_type:tool_failure

Alert 5: Apollo Tool Execution Failed Anomaly Monitor

Priority: P2 - High
Type: Anomaly Detection
Status: Active

Query

avg(last_4h):anomalies(avg:system.load.1{*}, 'basic', 2, direction='above', interval=60, alert_window='last_15m', count_default_zero='true') >= 1

Thresholds

  • Alert: >= 1 (anomaly detected)
  • Evaluation Window: last 4 hours, alert window last 15 minutes

Notifications

  • Alert: @slack-composio-alerts
  • Tags: service:apollo, anomaly:tool_execution

Alert 6: Apollo Tool Execution Failures - Anomaly Detection

Priority: P2 - High
Type: Anomaly Detection
Status: Active

Query

avg(last_4h):anomalies(avg:system.load.1{*}, 'basic', 2, direction='both', interval=60, alert_window='last_15m', count_default_zero='true') >= 0.01

Thresholds

  • Alert: >= 0.01 (anomaly detected in either direction)
  • Evaluation Window: last 4 hours, alert window last 15 minutes

Notifications

  • Alert: @slack-composio-alerts
  • Tags: service:apollo, anomaly:tool_execution

Alert 7: CPU load is high with {{threshold}} on {{ecs_service.name}}

Priority: P2 - High
Type: Metric
Status: Active

Query

avg(last_1h):avg:aws.ecs.service.memory_utilization{clustername:prod_cluster} by {servicename} > 75

Thresholds

  • Alert: > 75%
  • Evaluation Window: last 1 hour

Notifications

  • Alert: @slack-infrastructure-alerts @pagerduty-oncall
  • Tags: aws:ecs, cluster:prod_cluster, resource:cpu

Alert 8: Error rate is high on mcp-server production

Priority: P1 - Critical
Type: Anomaly Detection
Status: Active

Query

avg(last_1h):anomalies(avg:trace.http.server.request.errors{env:production,service:mcp-server-next}.as_rate(), 'basic', 2, direction='above', alert_window='last_5m', interval=20, count_default_zero='true') >= 0.9

Thresholds

  • Alert: >= 0.9 (anomaly confidence)
  • Evaluation Window: last 1 hour, alert window last 5 minutes

Notifications

  • Alert: @slack-composio-critical @pagerduty-oncall
  • Tags: service:mcp-server-next, env:production, anomaly:error_rate

Alert 9: High Error Rate on {{functionname.name}} in {{region.name}} for {{aws_account.name}}

Priority: P2 - High
Type: Metric
Status: Active

Query

sum(last_15m):sum:aws.lambda.errors{*} by {functionname,region,aws_account}.as_count() / sum:aws.lambda.invocations{*} by {functionname,region,aws_account}.as_count() >= 0.1

Thresholds

  • Alert: >= 10% error rate
  • Evaluation Window: last 15 minutes

Notifications

  • Alert: @slack-serverless-alerts @pagerduty-oncall
  • Tags: aws:lambda, resource:error_rate

Alert 10: Important Endpoint 5xx Error rate > {{threshold}}%

Priority: P1 - Critical
Type: Trace Analytics
Status: Active

Query

trace-analytics("(count[service:apollo env:production @hono.response.status_code:[500 TO 599] (@hono.request.path:"/api/v3/toolkits" (@hono.request.method:GET AND @hono.request.path:"/api/v3/auth_configs") OR (@hono.request.method:GET AND @hono.request.path:"/api/v3/connected_accounts") OR @hono.request.path:"/api/v3/tools" OR @hono.request.path:"/api/v3/connected_accounts/:nanoid" OR @hono.request.path:"/api/v3/internal/connected_accounts/link/:token" OR @hono.request.path:"/api/v3/tools/:tool_slug")] / max(count[service:apollo env:production (@hono.request.path:"/api/v3/toolkits" (@hono.request.method:GET AND @hono.request.path:"/api/v3/auth_configs") OR (@hono.request.method:GET AND @hono.request.path:"/api/v3/connected_accounts") OR @hono.request.path:"/api/v3/tools" OR @hono.request.path:"/api/v3/tools/execute/:tool_slug" OR @hono.request.path:"/api/v3/connected_accounts/:nanoid" OR @hono.request.path:"/api/v3/internal/connected_accounts/link/:token" OR @hono.request.path:"/api/v3/tools/:tool_slug")], 300)) * 100").last("5m") > 20

Thresholds

  • Alert: > 20%
  • Evaluation Window: last 5 minutes

Notifications

  • Alert: @slack-composio-critical @pagerduty-oncall
  • Tags: service:apollo, env:production, endpoints:critical

Alert 11: Lambda High Error Rate

Priority: P2 - High
Type: Metric
Status: Active

Query

avg(last_5m):sum:aws.lambda.errors{*} / sum:aws.lambda.invocations{*} > 0.05

Thresholds

  • Alert: > 5% error rate
  • Evaluation Window: last 5 minutes

Notifications

  • Alert: @slack-serverless-alerts
  • Tags: aws:lambda, resource:error_rate

Alert 12: Memory usage on {{servicename.name}} is high with {{threshold}}

Priority: P2 - High
Type: Metric
Status: Active

Query

avg(last_1h):avg:aws.ecs.service.memory_utilization{clustername:prod_cluster} by {servicename} > 75

Thresholds

  • Alert: > 75%
  • Evaluation Window: last 1 hour

Notifications

  • Alert: @slack-infrastructure-alerts @pagerduty-oncall
  • Tags: aws:ecs, cluster:prod_cluster, resource:memory

Alert 13: Polling Triggers Workflows are stopping

Priority: P1 - Critical
Type: Metric
Status: Active

Query

avg(last_5m):per_hour(sum:temporal_workflow_completed{namespace:polling-prod.kl3mw,workflow_type:polltriggerworkflow}) / 60 > 10

Thresholds

  • Alert: > 10 completions per minute
  • Evaluation Window: last 5 minutes

Notifications

  • Alert: @slack-composio-critical @pagerduty-oncall
  • Tags: service:temporal, workflow:polling, namespace:polling-prod

Alert 14: Thermos HTTP Request Errors >1%

Priority: P2 - High
Type: Metric
Status: Active

Query

sum(last_5m):sum:trace.http.request.errors{env:production, service:thermos}.as_count() / sum:trace.http.request.hits{env:production, service:thermos}.as_count() * 100 > 1

Thresholds

  • Alert: > 1% error rate
  • Evaluation Window: last 5 minutes

Notifications

  • Alert: @slack-composio-alerts
  • Tags: service:thermos, env:production, error_type:http

Alert 15: Tool anomaly alert

Priority: P2 - High
Type: Anomaly Detection
Status: Active

Query

avg(last_4h):anomalies(cutoff_min(sum:mercury.tool_call{env:production, error:true} by {toolkit_name,tool_name}.as_count(), 5) / cutoff_min(sum:mercury.tool_call{env:production} by {toolkit_name,tool_name}.as_count(), 5) * 100, 'basic', 2, direction='both', interval=60, alert_window='last_15m', count_default_zero='true') >= 1

Thresholds

  • Alert: >= 1 (anomaly detected)
  • Evaluation Window: last 4 hours, alert window last 15 minutes

Notifications

  • Alert: @slack-composio-alerts
  • Tags: service:mercury, env:production, anomaly:tool_errors

Alert 16: [AWS] RDS CPU utilization is high

Priority: P1 - Critical
Type: Metric
Status: Active

Query

max(last_15m):avg:aws.rds.cpuutilization{! dbinstanceidentifier:stagingrds} by {dbinstanceidentifier} > 80

Thresholds

  • Alert: > 80%
  • Evaluation Window: last 15 minutes

Notifications

  • Alert: @slack-database-critical @pagerduty-dba
  • Tags: aws:rds, resource:cpu

Alert 17: [AWS] RDS Storage utilization is high

Priority: P1 - Critical
Type: Metric
Status: Active

Query

avg(last_15m):100 - ((avg:aws.rds.free_storage_space{*} by {dbinstanceidentifier,engine} / avg:aws.rds.total_storage_space{*} by {dbinstanceidentifier,engine}) * 100) > 90

Thresholds

  • Alert: > 90% storage used
  • Evaluation Window: last 15 minutes

Notifications

  • Alert: @slack-database-critical @pagerduty-dba
  • Tags: aws:rds, resource:storage

Alert 18: [Mercury] OTA Module Load Failures

Priority: P2 - High
Type: Log Analytics
Status: Active

Query

logs("\"No module named 'ota'\" env:production").index("*").rollup("count").last("5m") > 1

Thresholds

  • Alert: > 1 occurrence
  • Evaluation Window: last 5 minutes

Notifications

  • Alert: @slack-composio-alerts
  • Tags: service:mercury, env:production, error_type:module_load

Quick Reference Table

# Alert Name Priority Type Threshold Service Status
1 4xx Error rate > 50% P2 Trace > 200% Apollo Active
2 5xx Error rate > 10% P1 Trace > 20% Apollo Active
3 ALB High 5XX Error P1 Metric > 10 ALB Active
4 All tool calls failed P1 Metric >= 100% Mercury Active
5 Apollo Tool Exec Anomaly P2 Anomaly >= 1 Apollo Active
6 Apollo Tool Failures Anomaly P2 Anomaly >= 0.01 Apollo Active
7 CPU load high on ECS P2 Metric > 75% ECS Active
8 MCP Server Error Rate P1 Anomaly >= 0.9 MCP Active
9 Lambda High Error Rate P2 Metric >= 10% Lambda Active
10 Important Endpoint 5xx P1 Trace > 20% Apollo Active
11 Lambda High Error Rate P2 Metric > 5% Lambda Active
12 Memory usage high on ECS P2 Metric > 75% ECS Active
13 Polling Triggers Stopping P1 Metric > 10/min Temporal Active
14 Thermos HTTP Errors P2 Metric > 1% Thermos Active
15 Tool anomaly P2 Anomaly >= 1 Mercury Active
16 RDS CPU utilization high P1 Metric > 80% RDS Active
17 RDS Storage high P1 Metric > 90% RDS Active
18 Mercury OTA Module Load P2 Log > 1 Mercury Active

Notes

  • All alerts are configured for production environment
  • Critical (P1) alerts trigger PagerDuty notifications
  • Alert thresholds should be reviewed quarterly
  • Update notification channels as team structure changes