The Problem
Agents work in notebooks. Ten test requests succeed. Deploy to thousand requests and something breaks. You can't tell what or why. Agents are different from API servers. Servers respond or don't. Agents fail halfway, orphan data, cascade failures, retry forever, or hang waiting for resources. The operational surface explodes. You can't kubectl exec into stuck agents. Can't restart and hope no duplication. Need systems built for these failures.
Task Ownership and Worker Death
A worker picks up a task and dies before finishing. Happens constantly: OOM kills, segfaults, network partitions. Worker locks the task forever unless it's explicitly released. Orloj uses lease-based task ownership. Workers claim tasks with time-bounded leases like auto-releasing locks. Renew before expiry or the lease auto-releases and another worker picks it up.
apiVersion: orloj.dev/v1
kind: AgentSystem
metadata:
name: email-processor
spec:
agents:
- name: classifier
model: gpt-4
tools:
- email.fetch
- email.classify
execution:
leaseTimeout: 10s
maxRetries: 3
Worker holds lease for 10 seconds. No renewal? Another worker claims it. No deadlock. But leases are half the solution. A worker might die after partial execution. Fetch email, start processing, crash, leave it corrupted. Next worker processes garbage.
Idempotency: The Actual Solution
Can't prevent failures. Make them safe to retry. Idempotent execution: run twice, same result. Worker crashes, next worker retries and converges without duplication or corruption. This means agent actions (tool calls, model invocations, database writes) need idempotency keys. Orloj tracks completed actions so you reason about partial executions.
spec:
agents:
- name: payment-processor
model: gpt-4-turbo
tools:
- stripe.charge
- ledger.record
execution:
idempotencyTracking: true
Orloj logs every tool call and result. Worker crashes after stripe.charge but before ledger.record? Next worker sees stripe.charge succeeded and skips it. Ledger.record runs. Payment recorded once. Requires deterministic tools and idempotency keys. Most systems support this. If yours don't, add it before production.
Cascading Failures and Isolation
Agent A calls slow tool X and waits. B and C also wait. Every worker thread blocks on X. System grinds. One slow dependency brings down everything. Isolation prevents this. Orloj runs tools in sandboxed containers or WASM with resource limits and timeouts. Slow tool times out, fails fast, doesn't block workers.
spec:
agents:
- name: researcher
tools:
- web.search:
timeout: 5s
retries: 2
isolation: container
- email.send:
timeout: 30s
isolation: wasm
web.search: 5-second timeout, fail fast, retry twice. Fail twice, task goes dead-letter. email.send: 30 seconds in WASM sandbox. Malicious or buggy tools can't consume unlimited resources. Timeouts are critical. Without them, you're betting dependencies never hang. They always do.
Dead-Letter Handling
Task fails three times. Options: A) retry forever, backlog grows, disk fills, crash. B) give up, lose work. C) dead-letter queue. Orloj chooses C.
spec:
execution:
maxRetries: 3
backoffMultiplier: 2
maxBackoffCap: 300s
deadLetterTransition: true
Task fails three times, Orloj applies capped exponential backoff. After third failure, moves to dead-letter. You get alert, inspect in UI, see why, fix, replay. Human-in-the-loop for failures. Agents handle happy paths. Orloj handles failures debuggably.
Task Ordering and Dependencies
Multi-agent systems are DAGs. A produces output. B consumes A. C consumes both. If scheduling ignores the DAG, B runs before A finishes and works with stale data.
apiVersion: orloj.dev/v1
kind: AgentSystem
metadata:
name: document-pipeline
spec:
agents:
- name: parser
model: gpt-4-vision
tools:
- document.extract
- name: validator
model: gpt-4
tools:
- schema.validate
dependencies:
- parser
- name: summarizer
model: gpt-4
tools:
- text.summarize
dependencies:
- parser
- name: publisher
model: gpt-4
tools:
- database.store
dependencies:
- validator
- summarizer
Parser first. Then validator and summarizer in parallel. Publisher waits for both. Tasks run in order automatically. Publisher before validator finishes? System prevents it. Middle task fails? Downstream blocked until fixed. Prevents silent inconsistency.
Retry Logic and Exponential Backoff with Jitter
Immediate retries are naive. The failure probably still exists. Backoff delays prevent thundering herds where all workers retry at once. Exponential backoff doubles delays, but without jitter all retries happen together, creating synchronized spikes.
spec:
execution:
maxRetries: 5
backoffMultiplier: 2.0
initialBackoff: 1s
maxBackoffCap: 60s
jitterFraction: 0.1
Retries: 1s, 2s, 4s, 8s, 16s, 32s (plus jitter at each). Jitter spreads retries across a window so recovering services don't get crushed. System caps at 60 seconds after five retries. Defaults work most cases. Adjust for your SLAs.
Monitoring Without the Noise
Traditional dashboards don't tell you what's happening. "Is task X stuck?" Can't just check runtime. Maybe it legitimately takes 10 minutes. Need to know: actively progressing (lease renewing) or stuck (lease expired)?
spec:
observability:
metricsExport: prometheus
leaseRenewalMetrics: true
taskStateChangeMetrics: true
deadLetterAlerts: true
Orloj exports metrics: tasks in progress, dead-letter, lease renewals per minute, retry frequency, time-to-completion. Alert on dead-letter transitions (unexpected issues need attention). Retry count alerts catch cascading issues before backlog explodes. Monitor task age—old tasks might be stuck or waiting on dependencies that never complete.
State and Consistency Across Restarts
Agent system needs updates. New version ships. Workers restart. What happens to in-progress tasks? Resume where they left off, not restart (expensive and duplicates) and not lost. Orloj persists to backing stores like PostgreSQL. Worker restarts? Claims lease-expired tasks and resumes. Idempotency makes resuming safe.
spec:
persistence:
backend: postgresql
connectionString: postgres://...
backupSchedule: daily
execution:
statePersistence: true
Tasks logged to PostgreSQL. Redeploy? Mid-execution tasks picked up and resumed. Already partially done? Idempotent tools skip re-execution. This is why idempotency is non-negotiable. Without it, restarts are scary.
Putting It Together: A Production Runbook
You've deployed your multi-agent system using Orloj. Here's what you actually monitor and respond to.
Dead-letter count: Alert if > zero. Tasks failed, need intervention.
Task age (p99): Exceeds SLA? Something's queued, not processing. Scaling limits or cascading failures.
Lease renewal failures: Workers crashing or can't contact scheduler. Severity-1 page.
Retry count spikes: Flaky tool, misconfigured permission, or data issue. Check next day.
Downstream blocks: Task in dead-letter blocks dependencies. Fix and replay to unblock.
Runbook: 1) Alert fires, go to dashboard. 2) Inspect dead-letter tasks, see what failed. 3) Fix root cause. 4) Replay from UI/CLI. 5) Task resumes idempotently, completes. Way simpler than hand-debugging multi-agent orchestration across a hundred log files.
Orloj is the orchestration plane, not the agent framework. Still need LLM libraries, tool definitions, model invocation, prompt engineering. Orloj handles infrastructure: scheduling, reliability, governance, observability. Assumes agents are defined and ready. Deliberate boundary: not reinventing frameworks, providing the missing operational layer.
Without Orloj, you're running pre-Kubernetes containers: ad-hoc scripts, no governance, no observability. Start small. Define one system in YAML. Deploy, monitor, hit failures, debug, fix. Gain operational clarity immediately. Docs have quickstart and real patterns. GitHub: github.com/OrlojHQ/orloj. Docs: orloj.dev/docs. Discord: discord.gg/a6bJmPwGS. Running agents reliably is hard. Orloj makes it less hard. Not magical. Engineered.
Related posts
What Happens When Your Agent Fails at 3am
Agent failures are different from service failures. Your retry strategy needs to be too.
Orloj vs. LangGraph vs. CrewAI: 2026 Update
Six months since our original comparison. All three frameworks shipped major updates. Here's what changed and what didn't.
Why Every Agent System Needs a Governance Layer (Not Just Guardrails)
Guardrails check outputs. A governance layer controls inputs, execution, access, and budget. They solve different problems. Most teams need both.