What Happens When Your Agent Fails at 3am

Jon MandrakiMarch 27, 2026

Your data-processing agent hits a model timeout at step seven of a twelve-step workflow. The wrapper script fails silently. Nobody gets paged. By morning, there are 47 stuck jobs in a database table and no audit trail of what actually happened. That's when someone calls you.

Most teams running agents in production either hope they don't break or duct-tape monitoring onto code that wasn't designed for failure.

Why Agent Failures Are Different

Standard services fail predictably. An API times out, returns a 500, and retry logic kicks in. Requests are idempotent, so retrying is safe.

Agent workflows are messier.

An agent doesn't just fetch and return. It reasons, picks tools, uses them, evaluates results, and loops. The work is stateful. If a tool call succeeds but the agent never sees the response, you've done half a job with no clean way to resume. Retry the whole workflow and you double-process records, trigger duplicate payments, or call APIs twice.

The tools themselves have side effects. Your agent calls Slack to post a message. That succeeds. Then the agent crashes before saving the run record. You retry and post the same message again. Now you have duplicate Slack notifications.

Retry logic itself is expensive. Retrying an agent workflow means re-running compute and re-calling models, burning tokens already spent. With services, you retry the cheap part. With agents, retries become your biggest operational cost.

What Happens Today

Most production setups treat agent failures like server errors.

The agent runs in a process or container. If it crashes or hangs, nothing happens automatically. You've set up monitoring that alerts you when runs haven't completed in five hours. You wake up at 3am, SSH into a box, check logs, manually restart the agent on stuck items, and figure out which ones processed.

If you're more sophisticated, you've got a message queue. The agent pulls a task, processes it, and marks it complete. If the agent dies mid-process, the task goes back on the queue. When the agent comes back up, it retries.

But that's a hard retry. Full reset. The agent starts over. If it died at step seven, it runs steps one through six again, wasting tokens and API calls. If those steps had unrecorded side effects, you get duplicates.

You could save checkpoints. Serialize agent state, current task, tool outputs. Restore from checkpoint and continue from step seven. The retry is cheaper.

But checkpointing is fragile. What do you checkpoint? Just outputs? Replay tool calls or trust the cached result? What if the tool isn't idempotent? What if external state changed between the crash and retry? The checkpoint is stale.

Now you're writing custom recovery logic per workflow. That becomes its own source of bugs.

How Orloj Handles It

Orloj assumes agent workflows will fail and be retried, sometimes on different workers. Reliability mechanics are built into the runtime, not bolted on.

When a worker picks up a task, it acquires a time-bounded lease (default 30 seconds). If the worker crashes or hangs, it doesn't renew the lease. The lease expires and another worker claims the task.

You don't detect crashes or ping workers. The clock handles it.

When a task fails, Orloj doesn't immediately give up. It applies capped exponential backoff with jitter: 1 second, 2 seconds, 4 seconds, up to a 5-minute cap. Jitter prevents thundering herds where all workers retry simultaneously.

But here's the key difference: agents in Orloj run in a sandbox where tool calls are tracked and idempotency is enforced. Every tool call gets a deterministic ID based on the agent's reasoning and parameters. If the same call executes twice, the second returns the cached result.

Your agent calls Slack to post a message. The call succeeds. The agent records the message ID. The agent crashes before persisting. On retry, the agent replays from the checkpoint. When it tries to post to Slack again, Orloj intercepts the call. It sees the same tool + parameters = same ID. It returns the cached result without invoking Slack a second time.

Idempotency by default.

Workflows still fail sometimes. Tool calls error. Models are unreachable. SLOs get exhausted. What then?

Failed tasks move to a dead-letter queue. They don't retry infinitely. You can see them, diagnose them, inspect the failure state, examine tool outputs, and check logs. Fix the problem and manually rerun, or discard it.

The dead-letter queue is observable. You know about stuck tasks before 3am. You have the context to debug them when you wake up.

A Concrete Example

Let's walk through a scenario.

You've got an agent that processes uploaded documents. It extracts text, classifies the document type, extracts structured fields based on the type, and stores the result in a database.

The workflow looks like:

Read file from S3
Run OCR (external service, sometimes slow)
Classify the document
Extract fields (via another LLM call)
Validate against schema
Write to database

In a typical setup, the agent runs in a Celery task. Steps 1-5 work. At step 6, the database is unavailable. The Celery task fails and retries. It re-reads from S3, re-runs OCR, re-classifies, re-extracts. The database comes back. The second attempt succeeds. But you've wasted tokens and hit the OCR service twice.

If you're unlucky, you've called OCR twice for the same document and have two database rows. If your database write is idempotent by design, you're safe. But that requires designing for it upfront.

In Orloj, the workflow executes the same steps. At step 6, the database write fails. The agent records the error. The task retries according to backoff.

Orloj has already recorded results from steps 1-5. When the retry happens, the agent resumes from step 6. If you retry the full workflow anyway, Orloj issues the same OCR call with the same deterministic ID. The OCR service doesn't get called twice. You get the cached result.

The agent code doesn't need to know about this. Idempotency happens at the tool-execution layer.

When the database comes back and the write succeeds, the task completes. You get a single database row with a clean audit trail.

The Retry Question

Once you start retrying, you have to answer questions about your system's semantics.

Is your agent idempotent? You think yes because the steps look idempotent. But you haven't guaranteed it. You've assumed it.

What about your tools? If a tool sends an email or updates a record, and you retry the tool call, does it happen twice?

What about database writes? If you store the agent's output and the write fails then succeeds on retry, do you have two rows?

Traditional systems require design-time answers. Agents require runtime answers, when customer data is at stake.

Orloj enforces the answer through the runtime. Tool calls are idempotent by default. Tasks are transactional: they complete or fail as a unit. Partial progress is invisible until the task fully succeeds. Retries are safe.

You don't have to think about whether your agent is safe to retry. It's a property of the execution model.

Why This Matters at Scale

Single agent, single task — failures are easy to handle. You notice, restart, move on.

At scale (fifty concurrent agents, thousands of daily tasks), failures are normal. You stop noticing individual failures and start seeing patterns. Tasks that consistently fail on certain inputs. Workflows that deadlock under load. Retry cascades that amplify themselves.

The difference between a system that handles this gracefully and one that doesn't is the difference between one on-call engineer per week and two. Between a 99.9% uptime SLO you can defend and one that gets waived at incident review.

Lease-based ownership eliminates monitoring overhead. Exponential backoff with jitter prevents thundering herds. Idempotency tracking keeps retries cheap. Dead-letter queues keep you from flying blind.

These are properties of a system built for scale from day one, not reliability bolted on later.

Orloj doesn't prevent failures. Agents still fail. But they fail predictably, they're visible, they can be retried safely, and you can sleep.

What Happens When Your Agent Fails at 3am

Why Agent Failures Are Different

What Happens Today

How Orloj Handles It

A Concrete Example

The Retry Question

Why This Matters at Scale

Related posts

How to Run Multi-Agent Systems Without Losing Sleep (or Data)

Orloj vs. LangGraph vs. CrewAI: 2026 Update

Why Every Agent System Needs a Governance Layer (Not Just Guardrails)