AI Agent Monitoring - spinningideas/resources GitHub Wiki

AI Agent Monitoring: Local & Cloud Observability Guide

A structured reference for monitoring AI agents that run locally, in the cloud, or across multiple providers. It covers telemetry standards, cost tracking, execution tracing, and provider-specific observability stacks.

Scope: The focus is on practical observability—what to collect, how to collect it, and where to analyze it—whether you are self-hosting on a laptop or running managed agents on AWS, Azure, or Google Cloud.

1. Why AI agent monitoring is different

Traditional application monitoring is built around deterministic signals: an API returns 200 or 500, a server responds or it does not. AI agents are different because the LLM dynamically directs its own process and tool usage [1]. A request can succeed technically yet still be wrong: the agent may hallucinate, call the wrong tool, retrieve the wrong document, or produce a plausible but incorrect answer. That means observability must capture:

Reasoning steps – which model calls, tool calls, and retrievals happened in what order.
Quality signals – whether the output was correct, safe, and on-task.
Cost signals – token usage, estimated spend, and per-agent/run attribution.
Security signals – permission waits, blocked tools, policy interceptions, and sensitive data exposure.

Google Cloud summarizes the core benefit: because agents are non-deterministic and complex, observability is essential for understanding, debugging, evaluating, and improving their performance, safety, and reliability [3].

2. What to monitor

2.1 Telemetry signals

Signal	What it captures	Typical use
Traces	End-to-end execution path with spans for each step (LLM call, tool call, retrieval) [1][11].	Debug a specific run, find where time is spent, identify failures.
Metrics	Token counts, latency, request counts, error rates, costs, session counts [3][8].	Dashboards, alerting, budget tracking, fleet health.
Logs	Raw events, errors, prompt/response content (when safe) [3][4].	Deep-dive troubleshooting, audit, security review.
Evaluations	Quality scores, hallucination rates, safety metrics, tool-use correctness [4][7].	Continuous quality monitoring, regressions, compare versions.

2.2 Key metrics to track

Tokens and cost – input, output, cached, and reasoning tokens; estimated or actual USD cost per run/agent [12][16][19].
Latency – time to first token, total duration, per-step latency [4][5].
Error rates – tool failures, API errors, rate-limit hits, timeout loops [8][9].
Tool usage – call counts, success/failure rates, latency per tool, frequency of “no tool called” [4][3].
Session metrics – number of sessions, turns per session, completion rates [8].
Quality metrics – response quality, hallucination rate, safety, user/tool acceptance rates [4][7][18].

3. Open standards: OpenTelemetry and GenAI semantics

The most reliable way to avoid vendor lock-in is to instrument agents with OpenTelemetry and the emerging GenAI semantic conventions. The OpenTelemetry GenAI SIG is defining standards for LLMs, vector databases, and AI agents so frameworks can report consistent traces, metrics, and logs [1].

3.1 Two instrumentation patterns

Baked-in instrumentation – the framework emits telemetry natively (e.g., CrewAI). Pros: zero setup for users. Cons: possible dependency bloat, risk of version lock-in, slower convention updates [1].
External instrumentation – separate OpenTelemetry packages (e.g., Traceloop, Langtrace, OpenTelemetry instrumentation-genai) are added to the agent. Pros: decouples observability from the framework, leverages community maintenance, easier to mix cloud providers. Cons: risk of package fragmentation if versions drift [1].

Regardless of the pattern, the telemetry should follow the same GenAI semantic conventions (gen_ai.* attributes) for model names, token counts, finish reasons, and so on [2][7].

3.2 Why standards matter for agents

Multi-agent topologies can be mapped across tools and services.
Cross-cloud dashboards can aggregate telemetry from AWS, Azure, and Google Cloud without re-instrumenting.
Cost attribution can be broken down by provider, model, agent, user, and run using the same attribute names.

4. Monitoring local agents

Local agents include scripts on your laptop, CLI tools such as Claude Code, local IDE extensions, or self-hosted frameworks running in Docker or Kubernetes.

4.1 SDK-level tracking

SDK / Tool	Built-in cost/usage tracking	OpenTelemetry export
OpenAI Agents SDK	Yes. `Runner.run(...)` returns `usage` with `requests`, `input_tokens`, `output_tokens`, `cached_tokens`, `reasoning_tokens`, and per-request entries in `request_usage_entries` [19].	Limited; can use hooks to send usage to a backend.
Anthropic Claude Agent SDK	Yes. Per-step token usage, per-model cost, and `total_cost_usd` on each `query()` result [16].	Yes. Native OTLP export of traces, metrics, and log events; spans include `claude_code.interaction`, `claude_code.llm_request`, and `claude_code.tool` [15].
Claude Code CLI	Via `total_cost_usd` in the result stream [16].	Yes. Emits OTLP metrics, logs, and traces (traces require `CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1`) [15].
Microsoft Foundry Toolkit for VS Code	Through the agent SDK.	Yes. Local OTLP-compatible collector for tracing during development [7].

4.2 Self-hosted observability backends

For local or on-premise work, open-source platforms can receive OTLP data and provide a full UI without sending data to a third party.

Arize Phoenix – open-source AI engineering platform. Free to self-host with no feature limits. Supports Docker, Kubernetes, Helm, and cloud templates. Provides tracing, evaluations, datasets, experiments, prompt management, and is built on OpenTelemetry/OpenInference [13][14]. Data can stay entirely in your infrastructure [13].
Langfuse – open-source LLM engineering platform. Can be self-hosted or used as a managed service. Captures traces, sessions, environments, tags, and distributed trace IDs. Supports OpenAI, Anthropic, LangChain, LlamaIndex, LiteLLM, and more [11]. It also tracks usage and cost by generation, supports custom model definitions and pricing tiers, and exposes aggregated metrics via a Metrics API [12].
OpenTelemetry Collector + Jaeger/Grafana/Prometheus – a vendor-neutral stack. You instrument once with OTLP, then route traces, metrics, and logs to the backend of your choice [1].

4.3 Local cost and quota helpers

The original file listed three community tools aimed at local monitoring, mostly around Google's Antigravity / Cloud Code quotas and the OpenClaw ecosystem. They are useful as examples of local, community-built monitors, but they are not official vendor tools:

Antigravity Quota Watcher – VS Code extension that shows live AI model quota usage in the status bar, with a dashboard, warnings, and automatic port detection [21].
Antigravity Claude Proxy – an unofficial proxy that exposes an Anthropic-compatible API backed by Google Antigravity Cloud Code, so Claude Code CLI can use it. The repository explicitly warns that this may violate Google's Terms of Service and can result in account bans [22].
Crabwalk – real-time companion monitor for OpenClaw agents, showing live activity graphs across WhatsApp, Telegram, Discord, and Slack [23].

5. Monitoring cloud-managed agents

5.1 AWS – Amazon Bedrock AgentCore

AWS provides a managed observability layer for Bedrock AgentCore through Amazon CloudWatch.

GenAI observability dashboard in CloudWatch shows trace visualizations, span metrics, error breakdowns, and session views [8][10].
Automatic instrumentation for agents deployed on the AgentCore Runtime using OpenTelemetry; no additional OTEL libraries are needed [9].
Non-hosted agents can be instrumented with the AWS Distro for OpenTelemetry (ADOT) SDK to emit metrics, spans, and logs to CloudWatch [8][9].
Built-in metrics include session count, latency, duration, token usage, and error rates [8].
Traces and spans are stored in CloudWatch Logs; you can view them via CloudWatch Transaction Search [8][10].

5.2 Microsoft Azure – AI Foundry and Application Insights

Azure uses Application Insights as the central store for agent telemetry and follows the OpenTelemetry GenAI semantic conventions [6][7].

Agent details view in Application Insights gives a unified monitoring experience for Microsoft Foundry agents, Copilot Studio agents, and third-party agents [6].
Server-side tracing is available for prompt agents, host agents, and workflows with no code changes once Application Insights is connected [7].
Client-side tracing can be added with the Azure SDK and OpenTelemetry SDK for custom application logic [7].
Foundry Agent Monitoring Dashboard tracks token usage, latency, success rates, evaluation outcomes, and supports continuous evaluation and alerts [7].
Local development is supported by the Microsoft Foundry Toolkit for VS Code, which can send traces to a local OTLP collector [7].
Fleet health can be monitored across multiple agents from the Foundry Control Plane [7].

5.3 Google Cloud – Gemini Enterprise Agent Platform / Vertex AI

Google Cloud recommends building agents with the Agent Development Kit (ADK) because it natively emits OpenTelemetry telemetry aligned with the GenAI semantic conventions [3].

Agent Runtime automatically collects built-in metrics in Cloud Monitoring without extra setup [5]. Metrics include request count, request latencies, and container CPU/memory allocation [5].
Observability dashboards in the Gemini Enterprise Agent Platform include Overview, Evaluation, Models, Tools, Usage, and Logs views [4].
Online monitors continuously assess quality, safety, hallucination rates, and tool-use quality [4].
Cloud Trace stores step-by-step execution traces with directed acyclic graphs of spans and inputs/outputs [4].
Topology views map multi-agent dependencies and traffic flows [4].
Model Armor natively emits telemetry for policy interceptions [4].

5.4 Anthropic – Claude API and Claude Code

Anthropic provides both API-level usage reporting and SDK-level telemetry export.

Usage & Cost Admin API gives granular, historical token and cost data. It supports time buckets (1m, 1h, 1d), filtering/grouping by API key, workspace, model, service tier, and context window, and server-side tool usage [17].
Cost API returns daily cost breakdowns in USD [17].
Claude Code Analytics API provides daily aggregated usage and productivity metrics for Claude Code (sessions, lines of code, commits, pull requests, tool acceptance/rejection rates, per-model tokens and costs) [18].
Agent SDK OpenTelemetry can export traces, metrics, and log events to any OTLP backend such as Datadog, Grafana, Honeycomb, or Langfuse [15]. Anthropic also lists partner integrations: CloudZero, Datadog, Grafana Cloud, Honeycomb, and Vantage [17].
Real-time cost tracking is available per query() call via total_cost_usd and per-model usage [16].

5.5 OpenAI – Agents SDK and organization usage APIs

OpenAI gives you both per-run usage and organization-level billing APIs.

OpenAI Agents SDK automatically tracks requests, input_tokens, output_tokens, total_tokens, cached_tokens, reasoning_tokens, and a per-request list request_usage_entries for each Runner.run(...) [19]. Usage can be read from result.context_wrapper.usage or from RunHooks [19].
Usage API (/v1/organization/usage/completions) returns token usage bucketed by minute, hour, or day, and can be grouped by project, user, API key, model, and batch status [20].
Costs API (/v1/organization/costs) returns daily cost buckets [20].
Admin API key is required for the Usage and Costs APIs [20].

6. Cross-platform observability platforms

The following platforms can ingest telemetry from multiple providers and frameworks, making them useful for hybrid or multi-cloud setups.

Platform	Type	Key capabilities	Source
Langfuse	Open-source + cloud	Traces, sessions, tags, cost/token tracking, custom model definitions, metrics API, OTLP support [11][12].	[11][12]
Arize Phoenix	Open-source + cloud	Self-hostable, free, no feature limits; tracing, evaluations, datasets, prompt management, OpenTelemetry [13][14].	[13][14]
LangSmith	Cloud + LangChain	Traces, evaluation, prompt management for LangChain/LangGraph agents.	Vendor docs (not cited here)
Braintrust	Cloud	Evals, logging, and experiments for AI agents.	Vendor docs (not cited here)
AgentOps	Cloud	Session tracking, cost, and agent reliability.	Vendor docs (not cited here)
Portkey	Gateway	LLM gateway with observability, routing, and cost control.	Vendor docs (not cited here)
Datadog / Honeycomb / Grafana Cloud	Observability backends	Claude API integrations and OTLP-based AI observability [17].	[17]

For a vendor-neutral stack, combine OpenTelemetry instrumentation with a backend from the list above.

7. Capability comparison by context

Capability	Local self-hosted (Phoenix / Langfuse)	AWS Bedrock AgentCore	Azure AI Foundry	Google Cloud Agent Platform	Anthropic Claude SDK	OpenAI Agents SDK
Execution tracing	Full span-level tracing [13][11]	CloudWatch traces/spans [8][10]	Application Insights traces [6][7]	Cloud Trace + DAG view [4]	OTLP traces [15]	Per-run usage, limited native traces [19]
Metrics & dashboards	Custom + metrics API [12]	CloudWatch GenAI dashboard [8]	Agent Monitoring Dashboard + Grafana [6][7]	Cloud Monitoring dashboards [4][5]	OTLP metrics [15]	Usage object per run [19]
Cost tracking	Per-generation/token cost [12]	Token usage + CloudWatch [8]	Token usage & cost [6][7]	Token usage + cost [4]	`total_cost_usd` + Usage API [16][17]	Usage API + Costs API [19][20]
Auto-instrumentation	SDK integration required	AgentCore runtime auto-OTEL [9]	Server-side for prompt/host agents [7]	ADK + Agent Runtime auto [3][5]	Env-variable OTLP [15]	Built-in usage tracking [19]
Evaluation / quality	Evals, datasets, experiments [13]	CloudWatch errors + custom metrics [8]	Continuous evaluation + alerts [7]	Online monitors (quality, safety, hallucination) [4]	Partner evals / custom via traces [17]	Hooks + custom evals [19]
Data residency / privacy	Self-hostable, air-gapped [13]	CloudWatch in your AWS account	Application Insights in your Azure subscription	Google Cloud Observability	Configurable, prompt logging opt-in [15]	Admin key access to billing data [20]

8. Best practices

Instrument with OpenTelemetry – Use the GenAI semantic conventions (gen_ai.*) so telemetry is portable across backends [1][2].
Track both real-time telemetry and billing data – Export OTLP signals for live debugging, and reconcile monthly spend against the provider's billing API (Anthropic Usage & Cost API, OpenAI Costs API) [17][20].
Attribute every cost to a run, agent, and user – Use service.name, deployment.environment, enduser.id, and custom tags for chargeback and anomaly detection [15][11].
Capture the full execution graph – Trace LLM calls, tool calls, retrievals, and handoffs so you can see why an agent produced an output, not just that it returned one [11][13].
Monitor quality, not just uptime – Add LLM-as-judge evals, hallucination checks, safety metrics, and tool-use correctness scores [4][7][13].
Protect sensitive data – Redact prompts, tool arguments, and PII before exporting; enable prompt logging only when needed and apply the same access controls as production logs [15][7].
Set budget and anomaly alerts – Alert on cost spikes, token surges, rate-limit errors, and latency regressions before they become incidents [5][8].
Use self-hosted backends for sensitive workloads – Arize Phoenix and Langfuse can run entirely on your infrastructure to keep traces and prompts in-house [13][11].
Evaluate locally before deploying – Use the Microsoft Foundry Toolkit, Arize Phoenix, or Langfuse locally to iterate on prompts, evals, and traces before pushing to production [7][13].
Keep instrumentation up to date – Agent frameworks and semantic conventions evolve quickly; review instrumentation packages and convention versions regularly [1].

9. References

OpenTelemetry – AI Agent Observability: Evolving Standards and Best Practices (2025) – https://opentelemetry.io/blog/2025/ai-agent-observability/
OpenTelemetry – Semantic Conventions for Generative AI – https://opentelemetry.io/docs/specs/semconv/gen-ai/
Google Cloud – Agent observability – https://docs.cloud.google.com/stackdriver/docs/observability/agent-observability
Google Cloud – Gemini Enterprise Agent Platform: Observability overview – https://docs.cloud.google.com/gemini-enterprise-agent-platform/optimize/observability/overview
Google Cloud – Gemini Enterprise Agent Platform: Set up monitoring – https://docs.cloud.google.com/gemini-enterprise-agent-platform/scale/runtime/monitoring
Microsoft Learn – Monitor AI agents with Application Insights – https://learn.microsoft.com/en-us/azure/azure-monitor/app/agents-view
Microsoft Learn – Set up tracing for AI agents in Microsoft Foundry – https://learn.microsoft.com/en-us/azure/foundry/observability/how-to/trace-agent-setup
AWS – Observe your agent applications on Amazon Bedrock AgentCore Observability – https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability.html
AWS – Get started with AgentCore Observability – https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-get-started.html
AWS – Amazon Bedrock AgentCore in CloudWatch – https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AgentCore-Agents.html
Langfuse – LLM Observability & Application Tracing – https://langfuse.com/docs/observability/overview
Langfuse – Token & Cost Tracking – https://langfuse.com/docs/observability/features/token-and-cost-tracking
Arize – Phoenix: Self-Hosting – https://arize.com/docs/phoenix/self-hosting
Arize – Phoenix: Get Started Tracing – https://arize.com/docs/phoenix/get-started/get-started-tracing
Anthropic – Claude Agent SDK: Observability with OpenTelemetry – https://code.claude.com/docs/en/agent-sdk/observability
Anthropic – Claude Agent SDK: Track cost and usage – https://code.claude.com/docs/en/agent-sdk/cost-tracking
Anthropic – Usage and Cost API – https://platform.claude.com/docs/en/manage-claude/usage-cost-api
Anthropic – Claude Code Analytics API – https://platform.claude.com/docs/en/manage-claude/claude-code-analytics-api
OpenAI – OpenAI Agents SDK: Usage – https://openai.github.io/openai-agents-python/usage/
OpenAI – How to use the Usage API and Cost API – https://developers.openai.com/cookbook/examples/completions_usage_api
GitHub – AntigravityQuotaWatcher – https://github.com/wusimpl/AntigravityQuotaWatcher
GitHub – antigravity-claude-proxy – https://github.com/badrisnarayanan/antigravity-claude-proxy
GitHub – crabwalk – https://github.com/luccast/crabwalk