AI Agent Monitoring - spinningideas/resources GitHub Wiki
AI Agent Monitoring: Local & Cloud Observability Guide
A structured reference for monitoring AI agents that run locally, in the cloud, or across multiple providers. It covers telemetry standards, cost tracking, execution tracing, and provider-specific observability stacks.
Scope: The focus is on practical observabilityβwhat to collect, how to collect it, and where to analyze itβwhether you are self-hosting on a laptop or running managed agents on AWS, Azure, or Google Cloud.
1. Why AI agent monitoring is different
Traditional application monitoring is built around deterministic signals: an API returns 200 or 500, a server responds or it does not. AI agents are different because the LLM dynamically directs its own process and tool usage [1]. A request can succeed technically yet still be wrong: the agent may hallucinate, call the wrong tool, retrieve the wrong document, or produce a plausible but incorrect answer. That means observability must capture:
- Reasoning steps β which model calls, tool calls, and retrievals happened in what order.
- Quality signals β whether the output was correct, safe, and on-task.
- Cost signals β token usage, estimated spend, and per-agent/run attribution.
- Security signals β permission waits, blocked tools, policy interceptions, and sensitive data exposure.
Google Cloud summarizes the core benefit: because agents are non-deterministic and complex, observability is essential for understanding, debugging, evaluating, and improving their performance, safety, and reliability [3].
2. What to monitor
2.1 Telemetry signals
| Signal | What it captures | Typical use |
|---|---|---|
| Traces | End-to-end execution path with spans for each step (LLM call, tool call, retrieval) [1][11]. | Debug a specific run, find where time is spent, identify failures. |
| Metrics | Token counts, latency, request counts, error rates, costs, session counts [3][8]. | Dashboards, alerting, budget tracking, fleet health. |
| Logs | Raw events, errors, prompt/response content (when safe) [3][4]. | Deep-dive troubleshooting, audit, security review. |
| Evaluations | Quality scores, hallucination rates, safety metrics, tool-use correctness [4][7]. | Continuous quality monitoring, regressions, compare versions. |
2.2 Key metrics to track
- Tokens and cost β input, output, cached, and reasoning tokens; estimated or actual USD cost per run/agent [12][16][19].
- Latency β time to first token, total duration, per-step latency [4][5].
- Error rates β tool failures, API errors, rate-limit hits, timeout loops [8][9].
- Tool usage β call counts, success/failure rates, latency per tool, frequency of βno tool calledβ [4][3].
- Session metrics β number of sessions, turns per session, completion rates [8].
- Quality metrics β response quality, hallucination rate, safety, user/tool acceptance rates [4][7][18].
3. Open standards: OpenTelemetry and GenAI semantics
The most reliable way to avoid vendor lock-in is to instrument agents with OpenTelemetry and the emerging GenAI semantic conventions. The OpenTelemetry GenAI SIG is defining standards for LLMs, vector databases, and AI agents so frameworks can report consistent traces, metrics, and logs [1].
3.1 Two instrumentation patterns
- Baked-in instrumentation β the framework emits telemetry natively (e.g., CrewAI). Pros: zero setup for users. Cons: possible dependency bloat, risk of version lock-in, slower convention updates [1].
- External instrumentation β separate OpenTelemetry packages (e.g., Traceloop, Langtrace, OpenTelemetry
instrumentation-genai) are added to the agent. Pros: decouples observability from the framework, leverages community maintenance, easier to mix cloud providers. Cons: risk of package fragmentation if versions drift [1].
Regardless of the pattern, the telemetry should follow the same GenAI semantic conventions (gen_ai.* attributes) for model names, token counts, finish reasons, and so on [2][7].
3.2 Why standards matter for agents
- Multi-agent topologies can be mapped across tools and services.
- Cross-cloud dashboards can aggregate telemetry from AWS, Azure, and Google Cloud without re-instrumenting.
- Cost attribution can be broken down by provider, model, agent, user, and run using the same attribute names.
4. Monitoring local agents
Local agents include scripts on your laptop, CLI tools such as Claude Code, local IDE extensions, or self-hosted frameworks running in Docker or Kubernetes.
4.1 SDK-level tracking
| SDK / Tool | Built-in cost/usage tracking | OpenTelemetry export |
|---|---|---|
| OpenAI Agents SDK | Yes. Runner.run(...) returns usage with requests, input_tokens, output_tokens, cached_tokens, reasoning_tokens, and per-request entries in request_usage_entries [19]. |
Limited; can use hooks to send usage to a backend. |
| Anthropic Claude Agent SDK | Yes. Per-step token usage, per-model cost, and total_cost_usd on each query() result [16]. |
Yes. Native OTLP export of traces, metrics, and log events; spans include claude_code.interaction, claude_code.llm_request, and claude_code.tool [15]. |
| Claude Code CLI | Via total_cost_usd in the result stream [16]. |
Yes. Emits OTLP metrics, logs, and traces (traces require CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1) [15]. |
| Microsoft Foundry Toolkit for VS Code | Through the agent SDK. | Yes. Local OTLP-compatible collector for tracing during development [7]. |
4.2 Self-hosted observability backends
For local or on-premise work, open-source platforms can receive OTLP data and provide a full UI without sending data to a third party.
- Arize Phoenix β open-source AI engineering platform. Free to self-host with no feature limits. Supports Docker, Kubernetes, Helm, and cloud templates. Provides tracing, evaluations, datasets, experiments, prompt management, and is built on OpenTelemetry/OpenInference [13][14]. Data can stay entirely in your infrastructure [13].
- Langfuse β open-source LLM engineering platform. Can be self-hosted or used as a managed service. Captures traces, sessions, environments, tags, and distributed trace IDs. Supports OpenAI, Anthropic, LangChain, LlamaIndex, LiteLLM, and more [11]. It also tracks usage and cost by generation, supports custom model definitions and pricing tiers, and exposes aggregated metrics via a Metrics API [12].
- OpenTelemetry Collector + Jaeger/Grafana/Prometheus β a vendor-neutral stack. You instrument once with OTLP, then route traces, metrics, and logs to the backend of your choice [1].
4.3 Local cost and quota helpers
The original file listed three community tools aimed at local monitoring, mostly around Google's Antigravity / Cloud Code quotas and the OpenClaw ecosystem. They are useful as examples of local, community-built monitors, but they are not official vendor tools:
- Antigravity Quota Watcher β VS Code extension that shows live AI model quota usage in the status bar, with a dashboard, warnings, and automatic port detection [21].
- Antigravity Claude Proxy β an unofficial proxy that exposes an Anthropic-compatible API backed by Google Antigravity Cloud Code, so Claude Code CLI can use it. The repository explicitly warns that this may violate Google's Terms of Service and can result in account bans [22].
- Crabwalk β real-time companion monitor for OpenClaw agents, showing live activity graphs across WhatsApp, Telegram, Discord, and Slack [23].
5. Monitoring cloud-managed agents
5.1 AWS β Amazon Bedrock AgentCore
AWS provides a managed observability layer for Bedrock AgentCore through Amazon CloudWatch.
- GenAI observability dashboard in CloudWatch shows trace visualizations, span metrics, error breakdowns, and session views [8][10].
- Automatic instrumentation for agents deployed on the AgentCore Runtime using OpenTelemetry; no additional OTEL libraries are needed [9].
- Non-hosted agents can be instrumented with the AWS Distro for OpenTelemetry (ADOT) SDK to emit metrics, spans, and logs to CloudWatch [8][9].
- Built-in metrics include session count, latency, duration, token usage, and error rates [8].
- Traces and spans are stored in CloudWatch Logs; you can view them via CloudWatch Transaction Search [8][10].
5.2 Microsoft Azure β AI Foundry and Application Insights
Azure uses Application Insights as the central store for agent telemetry and follows the OpenTelemetry GenAI semantic conventions [6][7].
- Agent details view in Application Insights gives a unified monitoring experience for Microsoft Foundry agents, Copilot Studio agents, and third-party agents [6].
- Server-side tracing is available for prompt agents, host agents, and workflows with no code changes once Application Insights is connected [7].
- Client-side tracing can be added with the Azure SDK and OpenTelemetry SDK for custom application logic [7].
- Foundry Agent Monitoring Dashboard tracks token usage, latency, success rates, evaluation outcomes, and supports continuous evaluation and alerts [7].
- Local development is supported by the Microsoft Foundry Toolkit for VS Code, which can send traces to a local OTLP collector [7].
- Fleet health can be monitored across multiple agents from the Foundry Control Plane [7].
5.3 Google Cloud β Gemini Enterprise Agent Platform / Vertex AI
Google Cloud recommends building agents with the Agent Development Kit (ADK) because it natively emits OpenTelemetry telemetry aligned with the GenAI semantic conventions [3].
- Agent Runtime automatically collects built-in metrics in Cloud Monitoring without extra setup [5]. Metrics include request count, request latencies, and container CPU/memory allocation [5].
- Observability dashboards in the Gemini Enterprise Agent Platform include Overview, Evaluation, Models, Tools, Usage, and Logs views [4].
- Online monitors continuously assess quality, safety, hallucination rates, and tool-use quality [4].
- Cloud Trace stores step-by-step execution traces with directed acyclic graphs of spans and inputs/outputs [4].
- Topology views map multi-agent dependencies and traffic flows [4].
- Model Armor natively emits telemetry for policy interceptions [4].
5.4 Anthropic β Claude API and Claude Code
Anthropic provides both API-level usage reporting and SDK-level telemetry export.
- Usage & Cost Admin API gives granular, historical token and cost data. It supports time buckets (1m, 1h, 1d), filtering/grouping by API key, workspace, model, service tier, and context window, and server-side tool usage [17].
- Cost API returns daily cost breakdowns in USD [17].
- Claude Code Analytics API provides daily aggregated usage and productivity metrics for Claude Code (sessions, lines of code, commits, pull requests, tool acceptance/rejection rates, per-model tokens and costs) [18].
- Agent SDK OpenTelemetry can export traces, metrics, and log events to any OTLP backend such as Datadog, Grafana, Honeycomb, or Langfuse [15]. Anthropic also lists partner integrations: CloudZero, Datadog, Grafana Cloud, Honeycomb, and Vantage [17].
- Real-time cost tracking is available per
query()call viatotal_cost_usdand per-model usage [16].
5.5 OpenAI β Agents SDK and organization usage APIs
OpenAI gives you both per-run usage and organization-level billing APIs.
- OpenAI Agents SDK automatically tracks
requests,input_tokens,output_tokens,total_tokens,cached_tokens,reasoning_tokens, and a per-request listrequest_usage_entriesfor eachRunner.run(...)[19]. Usage can be read fromresult.context_wrapper.usageor fromRunHooks[19]. - Usage API (
/v1/organization/usage/completions) returns token usage bucketed by minute, hour, or day, and can be grouped by project, user, API key, model, and batch status [20]. - Costs API (
/v1/organization/costs) returns daily cost buckets [20]. - Admin API key is required for the Usage and Costs APIs [20].
6. Cross-platform observability platforms
The following platforms can ingest telemetry from multiple providers and frameworks, making them useful for hybrid or multi-cloud setups.
| Platform | Type | Key capabilities | Source |
|---|---|---|---|
| Langfuse | Open-source + cloud | Traces, sessions, tags, cost/token tracking, custom model definitions, metrics API, OTLP support [11][12]. | [11][12] |
| Arize Phoenix | Open-source + cloud | Self-hostable, free, no feature limits; tracing, evaluations, datasets, prompt management, OpenTelemetry [13][14]. | [13][14] |
| LangSmith | Cloud + LangChain | Traces, evaluation, prompt management for LangChain/LangGraph agents. | Vendor docs (not cited here) |
| Braintrust | Cloud | Evals, logging, and experiments for AI agents. | Vendor docs (not cited here) |
| AgentOps | Cloud | Session tracking, cost, and agent reliability. | Vendor docs (not cited here) |
| Portkey | Gateway | LLM gateway with observability, routing, and cost control. | Vendor docs (not cited here) |
| Datadog / Honeycomb / Grafana Cloud | Observability backends | Claude API integrations and OTLP-based AI observability [17]. | [17] |
For a vendor-neutral stack, combine OpenTelemetry instrumentation with a backend from the list above.
7. Capability comparison by context
| Capability | Local self-hosted (Phoenix / Langfuse) | AWS Bedrock AgentCore | Azure AI Foundry | Google Cloud Agent Platform | Anthropic Claude SDK | OpenAI Agents SDK |
|---|---|---|---|---|---|---|
| Execution tracing | Full span-level tracing [13][11] | CloudWatch traces/spans [8][10] | Application Insights traces [6][7] | Cloud Trace + DAG view [4] | OTLP traces [15] | Per-run usage, limited native traces [19] |
| Metrics & dashboards | Custom + metrics API [12] | CloudWatch GenAI dashboard [8] | Agent Monitoring Dashboard + Grafana [6][7] | Cloud Monitoring dashboards [4][5] | OTLP metrics [15] | Usage object per run [19] |
| Cost tracking | Per-generation/token cost [12] | Token usage + CloudWatch [8] | Token usage & cost [6][7] | Token usage + cost [4] | total_cost_usd + Usage API [16][17] |
Usage API + Costs API [19][20] |
| Auto-instrumentation | SDK integration required | AgentCore runtime auto-OTEL [9] | Server-side for prompt/host agents [7] | ADK + Agent Runtime auto [3][5] | Env-variable OTLP [15] | Built-in usage tracking [19] |
| Evaluation / quality | Evals, datasets, experiments [13] | CloudWatch errors + custom metrics [8] | Continuous evaluation + alerts [7] | Online monitors (quality, safety, hallucination) [4] | Partner evals / custom via traces [17] | Hooks + custom evals [19] |
| Data residency / privacy | Self-hostable, air-gapped [13] | CloudWatch in your AWS account | Application Insights in your Azure subscription | Google Cloud Observability | Configurable, prompt logging opt-in [15] | Admin key access to billing data [20] |
8. Best practices
- Instrument with OpenTelemetry β Use the GenAI semantic conventions (
gen_ai.*) so telemetry is portable across backends [1][2]. - Track both real-time telemetry and billing data β Export OTLP signals for live debugging, and reconcile monthly spend against the provider's billing API (Anthropic Usage & Cost API, OpenAI Costs API) [17][20].
- Attribute every cost to a run, agent, and user β Use
service.name,deployment.environment,enduser.id, and custom tags for chargeback and anomaly detection [15][11]. - Capture the full execution graph β Trace LLM calls, tool calls, retrievals, and handoffs so you can see why an agent produced an output, not just that it returned one [11][13].
- Monitor quality, not just uptime β Add LLM-as-judge evals, hallucination checks, safety metrics, and tool-use correctness scores [4][7][13].
- Protect sensitive data β Redact prompts, tool arguments, and PII before exporting; enable prompt logging only when needed and apply the same access controls as production logs [15][7].
- Set budget and anomaly alerts β Alert on cost spikes, token surges, rate-limit errors, and latency regressions before they become incidents [5][8].
- Use self-hosted backends for sensitive workloads β Arize Phoenix and Langfuse can run entirely on your infrastructure to keep traces and prompts in-house [13][11].
- Evaluate locally before deploying β Use the Microsoft Foundry Toolkit, Arize Phoenix, or Langfuse locally to iterate on prompts, evals, and traces before pushing to production [7][13].
- Keep instrumentation up to date β Agent frameworks and semantic conventions evolve quickly; review instrumentation packages and convention versions regularly [1].
9. References
- OpenTelemetry β AI Agent Observability: Evolving Standards and Best Practices (2025) β https://opentelemetry.io/blog/2025/ai-agent-observability/
- OpenTelemetry β Semantic Conventions for Generative AI β https://opentelemetry.io/docs/specs/semconv/gen-ai/
- Google Cloud β Agent observability β https://docs.cloud.google.com/stackdriver/docs/observability/agent-observability
- Google Cloud β Gemini Enterprise Agent Platform: Observability overview β https://docs.cloud.google.com/gemini-enterprise-agent-platform/optimize/observability/overview
- Google Cloud β Gemini Enterprise Agent Platform: Set up monitoring β https://docs.cloud.google.com/gemini-enterprise-agent-platform/scale/runtime/monitoring
- Microsoft Learn β Monitor AI agents with Application Insights β https://learn.microsoft.com/en-us/azure/azure-monitor/app/agents-view
- Microsoft Learn β Set up tracing for AI agents in Microsoft Foundry β https://learn.microsoft.com/en-us/azure/foundry/observability/how-to/trace-agent-setup
- AWS β Observe your agent applications on Amazon Bedrock AgentCore Observability β https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability.html
- AWS β Get started with AgentCore Observability β https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-get-started.html
- AWS β Amazon Bedrock AgentCore in CloudWatch β https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AgentCore-Agents.html
- Langfuse β LLM Observability & Application Tracing β https://langfuse.com/docs/observability/overview
- Langfuse β Token & Cost Tracking β https://langfuse.com/docs/observability/features/token-and-cost-tracking
- Arize β Phoenix: Self-Hosting β https://arize.com/docs/phoenix/self-hosting
- Arize β Phoenix: Get Started Tracing β https://arize.com/docs/phoenix/get-started/get-started-tracing
- Anthropic β Claude Agent SDK: Observability with OpenTelemetry β https://code.claude.com/docs/en/agent-sdk/observability
- Anthropic β Claude Agent SDK: Track cost and usage β https://code.claude.com/docs/en/agent-sdk/cost-tracking
- Anthropic β Usage and Cost API β https://platform.claude.com/docs/en/manage-claude/usage-cost-api
- Anthropic β Claude Code Analytics API β https://platform.claude.com/docs/en/manage-claude/claude-code-analytics-api
- OpenAI β OpenAI Agents SDK: Usage β https://openai.github.io/openai-agents-python/usage/
- OpenAI β How to use the Usage API and Cost API β https://developers.openai.com/cookbook/examples/completions_usage_api
- GitHub β AntigravityQuotaWatcher β https://github.com/wusimpl/AntigravityQuotaWatcher
- GitHub β antigravity-claude-proxy β https://github.com/badrisnarayanan/antigravity-claude-proxy
- GitHub β crabwalk β https://github.com/luccast/crabwalk