Agent Observability: What to Monitor When Your Software Has Opinions

Jon MandrakiMay 10, 2026

Your database query takes 150ms. You have an alert for it. Cache hit rate is 87%. You know what's normal.

Your agent thinks for 45 seconds. Five tool calls. Three retries. Costs $0.23. You have no idea if that's good or bad.

Traditional APM measures the wrong things. Response time tells you how long the agent deliberated, not whether the deliberation was sound. Error rates don't distinguish between a tool failure and a wrong decision.

You need different metrics.

What Traditional APM Misses

APM measures infrastructure:

Latency (how long it took)
Throughput (how many requests per second)
Error rate (what percentage failed)
Resource usage (CPU, memory, disk)

These work for stateless request-response systems. They don't work for agents.

An agent that takes 120 seconds to reason and make four tool calls isn't automatically worse than one that completes in 2 seconds. The slow agent might be reasoning carefully through a complex decision. The fast agent might be hallucinating confidently.

Error rates are worse for agent systems. If an agent tries to call a tool it doesn't have permission to use, and the system blocks it with a 403, is that an error? By traditional APM standards, yes. By operational standards, no — the system worked exactly as designed.

Agent systems have operational concerns that traditional APM doesn't measure:

Token consumption: Each token costs money. Is the agent reasoning for 5 seconds (cheap) or 50 seconds (expensive)? Did it suddenly start consuming more tokens?
Decision quality: Did it call the right tool, or did it confidently call the wrong one? Are its decisions getting worse?
Tool accuracy: Which tools does the agent trust and use frequently? Which ones does it repeatedly fail to use correctly?
Governance effectiveness: How many unauthorized calls did the system block? Is governance actually working, or is it just theater?
Cost attribution: What's the actual dollar cost of this execution? Which agents are expensive? Which are cheap? Is cost growing?

These aren't infrastructure metrics. They're observability metrics. You need them to know if your agent is actually working correctly.

Token Consumption and Cost

Every execution consumes tokens. Input, output, maybe fine-tuned premiums. You're paying for computation traditional metrics ignore.

Track these:

Cost per execution

Agent: data-processor
Total executions: 1,247
Total cost: $87.53
Cost per execution: $0.070
95th percentile: $0.156

Normal execution costs $0.070. Some cost $0.15 (likely complex cases). If cost jumps to $0.30 and stays there, something changed. Agent reasoning longer. Context window expanded. Context quality degraded. You need to detect this and investigate before it becomes a $10,000/day bill.

Cost by outcome

Successful: $0.061
Tool error (retried): $0.089
Timeout: $0.124
Authorization violation: $0.008

Success is cheapest. Retries cost more. Timeouts cost most (agent spent time thinking). Authorization violations are cheap (system caught it before wasting tokens).

Token efficiency

Tokens per task: 2,400 avg
Output tokens: 180 avg
Input-to-output: 13.3:1

High input ratios mean you're feeding context (documents, database queries, conversation history). Low ratios mean the agent is thinking hard (generating long reasoning chains).

Neither is inherently bad. But the ratio tells you what work the agent is actually doing. If your ratio shifts from 10:1 to 20:1 without explanation, something changed. Maybe you added more documents to the context. Maybe the agent is reasoning longer. You need to know which.

Cost velocity

Daily: $87
Weekly: $612
Trend: +3.2% week-over-week
30-day forecast: $2,610

Cost growing 3% weekly while load is flat? Red flag. Something is changing. Agent reasoning longer. More edge cases hitting the agent. Model temperature set too high and it's exploring instead of committing. You need to know why. If this continues, you're looking at a $2,600/month increase, which compounds fast.

In Orloj, every execution reports:

execution_id: exe-4729
status: success
latency_ms: 3421
token_usage:
  input_tokens: 2847
  output_tokens: 341
  cost_usd: 0.089
tool_calls:
  - name: fetch_customer_data
    success: true
    latency_ms: 145

Query by time, by agent, by outcome. Build alerts on what matters.

Decision Quality Proxies

You can't directly measure if the decision was right. "Right" depends on context — the user's actual intent, the business outcome, the long-term consequences — that you don't have.

But you can measure proxies that correlate strongly with decision quality:

Tool call success rate

Tool: fetch_customer_data
Success: 97.3%
Typical failure: invalid customer_id
Success latency: 145ms median
Failure latency: 187ms (timeout after retries)

If an agent calls the same tool and it fails 30% of the time, something is wrong. Either the agent is calling it with invalid arguments, or the tool is unreliable, or there's a race condition.

High failure rate on a specific tool tells you that tool is a bottleneck. It's limiting the reliability of every agent that depends on it. Consider whether the tool needs fixing, or whether the agent needs to handle failures more gracefully.

Retry patterns

Executions requiring retry: 12.3%
Primary failure: tool_timeout
Secondary: transient_network_error
Avg retries: 2.1
Success after retry: 89%

Some retries are normal. Networks fail. But if 30% of executions retry, the token budget is unsustainable.

Retries on permanent errors (auth failure, malformed input)? Agent doesn't understand the tool's constraints.

Tool call distribution

fetch_customer_data: 2.1 calls/exec, 97.3% success
validate_customer_data: 1.8 calls/exec, 99.1% success
write_to_warehouse: 0.3 calls/exec, 95.8% success

Frequent tools, rare tools, expensive tools. This shows what work the agent is actually doing.

Distribution shifts dramatically? Red flag. Agent suddenly stops calling validate_customer_data. Model changed. Context changed. Investigate.

Error source analysis

Total attempts: 3,251
Successful: 3,122 (95.9%)
Authorization denied: 47 (1.4%)
Tool not found: 8 (0.2%)
Malformed input: 34 (1.0%)
Timeout: 40 (1.2%)

Authorization denials tell you governance is working. The agent tried to do something it shouldn't have, and the system blocked it. That's success, not failure. Your policies are functioning.

Tool not found errors suggest the agent is confused about what tools are available. Maybe the tool was deprecated and the agent wasn't updated. Maybe it's calling a tool from an old version of the catalog.

Malformed input suggests the agent doesn't understand the tool's input schema. Maybe the schema changed and the agent's knowledge is stale. Maybe the LLM is hallucinating fields that don't exist.

Timeouts suggest either the tool is slow, or your timeout is too aggressive. A 5-second timeout on a tool that takes 3 seconds 95% of the time but 8 seconds 1% of the time is going to fail on edge cases.

Governance Violation Rates

Orloj enforces authorization at execution. Policies from your manifest are checked before every tool call.

Track these:

Policies checked

Total tool calls: 3,251
Policies evaluated: 9,753
Policies enforced: 127
Enforcement rate: 3.9%

Not every call triggers every policy. Only calls matching policy conditions are actually evaluated. A policy that says "enforce if cost > $1000" won't evaluate on a $50 call.

Track how often policies actually fire. If you have 20 policies and only 1 ever fires, either your policies are well-scoped (they're only needed in rare cases), or they're ineffective (the conditions never match). You need to understand which.

Violations caught

Authorization denied: 47
Quota exceeded: 12
Rate limit hit: 3
Cost threshold exceeded: 1

System caught 63 dangerous actions before they reached a tool. Governance layer is active and working.

This is important: if this number is zero, either your governance is too permissive, or your agents are perfectly behaved (unlikely). Most likely, you're not catching things you should be.

Policy effectiveness

data_deletion_requires_approval
- Triggered: 23
- Approvals granted: 19
- Approvals denied: 4
- Success rate: 82.6%

Policy is working. Agent attempts deletions, policy catches them, approvers say yes 82% of the time.

High denial rate (50%+)? Your agent misjudges when deletion is safe, or your policy is too strict.

Bypass attempts

Exceeded quota by 15%: 1
Called deprecated tool: 3
Called tool outside hours: 7

Agents try to circumvent governance. Reformat requests to bypass quota. Call old tool versions. Track these.

Latency Distribution

Agent latency is multi-modal. Some finish in 500ms. Others take 30 seconds thinking.

Don't use mean. Use percentiles.

P50: 2.3s
P95: 8.7s
P99: 31.2s
Max: 87.6s

Median is 2.3s. 99th is 31s. That's 13x.

Break latency by phase:

Initial thinking phase

P50: 1.2s
P95: 4.1s

Agent reasoning about the request before making any tool calls. Simple requests are fast. Complex requests requiring careful planning take longer.

If thinking latency jumps suddenly, either requests are more complex, or the model is slower (maybe prompt changed, maybe overloaded).

Tool calling loop

Iterations: 1.1 avg
P50: 0.8s per iteration
P95: 2.1s per iteration

Does the agent iterate once and commit, or loop multiple times? Looping 5+ times per request either means it's being thorough, or it's thrashing and can't decide.

Tool execution (the tools themselves)

fetch_customer_data: 145ms median
write_to_warehouse: 800ms median

Slow tools dominate overall latency. If write_to_warehouse takes 800ms, the agent can't be faster than 800ms on that path. Know your tool speeds.

Agent decision latency (model only)

P50: 0.1s
P95: 0.2s

How long for the LLM to produce a response. Pure model performance.

Alert on pattern shifts, not just absolute thresholds. If P95 jumps from 8s to 15s and stays there, something structural changed. Investigate before it gets worse.

Implementing Observability in Orloj

Orloj is built with observability as a core concern, not an afterthought. Every execution emits a structured event stream. Every significant event gets logged: agent start, tool call, governance check, error, retry, agent finish.

A typical execution event looks like:

execution_id: exe-4729
timestamp: 2026-05-10T14:23:44Z
event_type: tool_call
agent: data-processor
tool: fetch_customer_data
status: success
latency_ms: 145
input_tokens: 284
output_tokens: 47
cost_usd: 0.0089
governance_result: allowed
governance_policies_evaluated:
  - data_access_quota: allowed
  - role_based_access: allowed

From these structured events, you can build powerful observability patterns:

Time series metrics: Cost per minute, tool success rate, authorization denial rate. Feed these into your existing monitoring stack to get historical trends.

Correlation analysis: When a tool fails, do agents downstream also fail? Which tool failures correlate with agent failures? Are certain tools unreliable?

Trend detection: Is token consumption growing? Is governance effectiveness declining? Are certain agents becoming more expensive over time? Catch these trends before they become crises.

Alerting: Cost exceeds your monthly budget? Error rate spikes 5x? A critical tool becomes unavailable? Alert immediately.

Orloj also exposes an HTTP metrics API so you can query programmatically:

# Cost for agent data-processor, last 7 days
curl https://orloj.example.com/api/v1/metrics/cost \
  -d '{"agent": "data-processor", "since": "7d"}'

# Tool success rates, last 24 hours
curl https://orloj.example.com/api/v1/metrics/tools \
  -d '{"metric": "success_rate", "since": "24h"}'

# Governance violations, grouped by policy
curl https://orloj.example.com/api/v1/metrics/governance \
  -d '{"group_by": "policy"}'

Pipe these into your existing stack (Datadog, Prometheus, Grafana, CloudWatch). Build dashboards. Set up alerts. Integrate with your incident response workflow.

What Good Observability Looks Like

Cost visibility — Know what every execution costs, what you spend per day, where the cost goes.

Decision quality signals — Tool success rates, retry patterns, error distribution tell you if the agent is reasoning well.

Governance confidence — See how many unauthorized actions the system blocks, whether policies work.

Latency context — Understand where time is spent: thinking, tool calling, execution.

Trend tracking — See behavior changes (cost growing, tool reliability declining, violations increasing). Alert on it.

Runbook data — When something breaks, observability tells you what: which tools failed, which policies fired, what the agent tried.

Without this, you're blind. You'll notice the problem when the bill arrives, not when your agent stops working.

Get started with Orloj:

GitHub: github.com/OrlojHQ/orloj
Docs: orloj.dev/docs
Discord community: discord.gg/a6bJmPwGS

Agent Observability: What to Monitor When Your Software Has Opinions

What Traditional APM Misses

Token Consumption and Cost

Decision Quality Proxies

Governance Violation Rates

Latency Distribution

Implementing Observability in Orloj

What Good Observability Looks Like

Related posts

Orloj vs. LangGraph vs. CrewAI: 2026 Update

Why Every Agent System Needs a Governance Layer (Not Just Guardrails)

Orloj vs. Microsoft Semantic Kernel Agent Framework