← Blog

Testing Agent Systems: Why Unit Tests Aren't Enough

Jon Mandraki

You've built an agent. It works in your notebook. You deploy it to production. Two hours later, on-call pages you. The agent called a tool it shouldn't have access to, or retried a destructive operation that was already committed.

Your unit tests passed. All of them. The problem: unit tests can't catch the bugs that matter in agent systems.

Why Unit Tests Break Down for Agents

Traditional unit testing assumes determinism. Same input, same output. Mock the dependencies, assert the return value, ship it.

Agents don't work that way. An agent's decision (which tool to call, with what arguments, in what order) depends on multiple sources of non-determinism:

  1. Model weights and temperature settings — Different configurations produce different outputs from the same input.
  2. Retrieved context — The context you feed the agent might differ on retry. Search results change. Database queries return different rows.
  3. Available tools and their state — If a tool is unavailable, the agent has to make different decisions.
  4. Random seed affecting sampling — The model uses sampling to generate tokens. Different seeds produce different outputs.

Same invocation, same inputs, completely different tool calls. Unit tests assume determinism. Agents don't provide it.

A passing test suite tells you nothing about whether the agent respects governance at runtime, retries safely, or recovers from tool failures without corrupting state. It tells you the agent's code compiles and runs. It tells you almost nothing about the agent's behavior in production.

You need fundamentally different tools. Tools that work with non-determinism instead of against it.

Property-Based Testing for Non-Deterministic Systems

Property-based testing stops caring what the agent decides. It verifies that regardless of the decision, certain properties hold true.

For agents, start with these:

  • Authorization: Every tool call is authorized. Unauthorized calls fail closed.
  • Idempotency: The same request twice doesn't corrupt state or create duplicates.
  • Isolation: One agent's failure doesn't cascade to another.
  • Governance: Forbidden tool calls fail. They don't succeed silently.

In Orloj, you express these properties through your manifest. Here's a simplified example:

apiVersion: orloj.io/v1
kind: Agent
metadata:
  name: data-agent
spec:
  model: claude-opus
  tools:
    - name: read_database
      authorization:
        role: viewer
    - name: write_database
      authorization:
        role: admin
    - name: delete_database
      authorization:
        role: owner

Now test the property: given this manifest, an agent with 'viewer' role never executes delete_database, even if the model explicitly requests it. This should hold true regardless of the agent's training, the current temperature setting, or the context being processed.

Test what the system prevents, not what the agent decides. This scales because you're not trying to predict agent behavior. You're verifying system boundaries.

Pair property testing with fuzzing. Generate random tool configurations, random permission matrices, random instructions. Run 10,000 scenarios. When a core property breaks (unauthorized calls get denied), you've found a boundary bug, not a reasoning bug. Boundary bugs are security issues and they block traffic entirely. Reasoning bugs are annoying but recoverable.

Contract Testing for Tool APIs

Your agent's tools are dependencies. They have contracts: accept these inputs, return these outputs, may have side effects. They fail in predictable ways.

Real integration tests call the real tool. You get genuine behavior but at high cost and risk. Every test run hits production infrastructure, generates real side effects, burns real money.

Fake tools that don't match the real thing are worse than useless. The agent will behave differently in test than in production. You'll ship a broken system that worked great in staging.

Contract testing sits in the middle. You define the contract once, then verify both sides of it:

tools:
  - name: create_user
    input:
      type: object
      properties:
        email:
          type: string
          format: email
        username:
          type: string
          minLength: 3
      required:
        - email
        - username
    output:
      type: object
      properties:
        user_id:
          type: string
        created_at:
          type: string
          format: date-time
      required:
        - user_id
        - created_at
    error_scenarios:
      - input: {email: "invalid"}
        expected_error: "InvalidEmailError"
      - input: {username: "ab"}
        expected_error: "UsernameTooShortError"

Your agent tests verify:

  • When the agent calls create_user, it sends valid email and username (not null, not empty).
  • The agent reads user_id from the response (not user_uuid, which is from an old version).
  • The agent retries on temporary errors (500s, timeouts), fails closed on validation errors (400s).
  • The agent handles the optional fields correctly.

Your tool tests verify:

  • The real tool matches the contract in all declared scenarios.
  • Real error conditions match the declared error scenarios.
  • Side effects are idempotent (calling it twice with the same input is safe) or fail atomically (either fully succeeds or fully fails, no half-state).

Contract violations surface immediately in CI. No late-night production surprises from mismatched assumptions.

Scenario Replay for Multi-Step Workflows

Agents work in steps. Fetch data, process it, write results, send confirmation. If step 2 fails mid-execution, can the agent safely retry from step 1? Or will it duplicate work? Or leave state inconsistent?

Scenario replay captures a real execution trace and replays it deterministically. You're not simulating what the agent might do. You're replaying what it actually did.

When an agent executes a workflow in production, Orloj logs each step in order:

timestamp: 2026-04-28T14:23:44Z
step: 1
action: tool_call
tool: fetch_customer_data
input: {customer_id: "C-42"}
output: {data: {...}, latency_ms: 145}
---
timestamp: 2026-04-28T14:23:45Z
step: 2
action: tool_call
tool: validate_customer_data
input: {data: {...}}
output: {valid: true}
---
timestamp: 2026-04-28T14:23:46Z
step: 3
action: tool_call
tool: write_to_warehouse
input: {data: {...}}
error: timeout_after_5s

Replay this scenario 100 times with different random seeds. Each replay should do one of three things:

  1. Succeed completely (all steps execute, same as production)
  2. Fail at exactly the same place (e.g., step 3, timeout)
  3. Fail at a different place, but only if the system is genuinely non-deterministic at that point

If some replays succeed and some fail at the same step with the same inputs, you've got a race condition. The system isn't safe to retry.

Scenario replay is critical for post-mortems. Agent made a bad call at 3am. Extract the execution trace, replay it in staging with your proposed fix, verify that the fix actually prevents the failure. This gives you confidence before deploying at 4am.

Chaos Testing for Agent Resilience

Your agent runs in a distributed system. Networks fail. Services timeout. Databases go down during backups. Tools return 500s at 11pm. Circuit breakers trip. Credentials expire mid-execution.

Chaos testing deliberately introduces these failures and verifies the agent handles them gracefully. Not that it recovers perfectly, but that it fails predictably and safely.

Common scenarios to test:

  • Tool timeout: Call takes longer than the timeout. Does the agent timeout and retry? Or timeout and fail closed?
  • Tool error: Tool returns 500. Does the agent know if it's retryable (will work if we try again) or permanent (invalid request)?
  • Partial state: Tool succeeds at a write but fails to send the acknowledgment. Retry and create a duplicate, or recover idempotently?
  • Model unavailable: Your LLM provider is down. Agent fails fast (good) or hangs for 10 minutes trying (bad)?
  • Credential rotation: Auth token expires mid-execution. Does the agent refresh, or does it fail?

In Orloj, you declare resilience policies in the manifest:

apiVersion: orloj.io/v1
kind: Agent
metadata:
  name: data-processor
spec:
  retry:
    initial_delay_ms: 100
    max_delay_ms: 30000
    max_attempts: 5
    backoff: exponential_with_jitter
  tool_timeout_ms: 5000
  model_timeout_ms: 10000
  circuit_breaker:
    failure_threshold: 5
    success_threshold: 2
    timeout_ms: 60000

Your chaos tests verify:

  • Tool timeout triggers retry with exponential backoff, capped at 30 seconds total (not 300 seconds of hammering a dead service).
  • Tool unhealthy (5 consecutive failures): circuit breaker opens immediately, subsequent calls fail fast (not timing out).
  • Model unavailable: agent stops trying after 10 seconds, not after 10 minutes of retries.
  • Partial state: if a tool says "success" but then timeouts, the agent detects this and doesn't retry (avoiding duplicates).

Chaos testing catches what property tests and scenario replay can't. They test happy paths or expected paths. Chaos tests the uncommon but devastating paths. The difference between "theoretically sound" and "survives 3am production incidents when everything is breaking at once."

Integration Testing with Governance

Orloj enforces governance at execution. Your manifests declare policies. But do they actually prevent bad behavior?

Integration tests verify the boundary between agent intent and operational constraints:

apiVersion: orloj.io/v1
kind: AgentPolicy
metadata:
  name: data-deletion-requires-approval
spec:
  agents:
    - data-processor
  tools:
    - delete_database
  condition: "cost > 1000 or destructive_operation == true"
  action: require_approval
  approval_roles:
    - dba
    - platform-owner

Test:

  1. Configure agent with minimal permissions.
  2. Instruct it to delete a large dataset.
  3. Verify deletion does NOT execute; instead triggers approval.
  4. Simulate approver approving.
  5. Verify deletion now executes.

This catches governance misconfigurations that unit tests can't reach.

A Real Test Suite

Structure your test pyramid:

Unit tests — Business logic, mocked tools. They matter. They just can't catch agent-specific bugs.

Property tests — Authorization boundaries hold across all agent decisions. Fuzz configurations. Unauthorized calls always fail.

Contract tests — Agent usage matches tool contracts. Error handling works.

Scenario tests — Captured real executions. Deterministic replay. Verify determinism where it should exist.

Chaos tests — Break things deliberately. Verify graceful failure, recovery, timeouts, retries.

Integration tests — End-to-end workflows. Real tools. Real governance. Policies actually prevent bad actions.

All six layers passing: you have confidence.

Orloj's Testing Support

Orloj is designed to be testable. The system exposes structured execution logs and APIs that make each layer of testing straightforward.

Manifest-driven defaults: Retry policy, timeouts, circuit breakers are declared in YAML, not scattered across application code. This means you can change resilience behavior without redeploying the agent itself. Tests can vary these parameters and verify the agent's behavior under different conditions.

Deterministic replay: Every execution is logged completely. You can replay an execution from production in staging with deterministic results. Same inputs, same random seed, same execution trace. This is how you validate fixes before deploying them.

Policy visualization: You can query exactly what policies applied to a given execution, and why a specific tool call succeeded or failed. This is invaluable for debugging governance issues. You can ask, "Why did this tool call get blocked?" and get a precise answer.

Observability hooks: You can assert on metrics in your tests. "For this input, token consumption should be between 500 and 3000." "Latency should not exceed 10 seconds." Build tests that measure the operational characteristics of your agent, not just whether it produces a correct output.

Testing an agent system isn't fundamentally different from testing any distributed system. The agents are the primary source of non-determinism, not the infrastructure. Build your test suite accordingly, and you'll have confidence that your agent survives contact with production.


Get started with Orloj:

Related posts