← Blog

How to Run Multi-Agent Systems Without Losing Sleep (or Data)

Jon Mandraki

The Problem

Agents work in notebooks. Ten test requests succeed. Deploy to thousand requests and something breaks. You can't tell what or why. Agents are different from API servers. Servers respond or don't. Agents fail halfway, orphan data, cascade failures, retry forever, or hang waiting for resources. The operational surface explodes. You can't kubectl exec into stuck agents. Can't restart and hope no duplication. Need systems built for these failures.

Task Ownership and Worker Death

A worker picks up a task and dies before finishing. Happens constantly: OOM kills, segfaults, network partitions. Worker locks the task forever unless it's explicitly released. Orloj uses lease-based task ownership. Workers claim tasks with time-bounded leases like auto-releasing locks. Renew before expiry or the lease auto-releases and another worker picks it up.

apiVersion: orloj.dev/v1
kind: AgentSystem
metadata:
  name: email-processor
spec:
  agents:
    - name: classifier
      model: gpt-4
      tools:
        - email.fetch
        - email.classify
  execution:
    leaseTimeout: 10s
    maxRetries: 3

Worker holds lease for 10 seconds. No renewal? Another worker claims it. No deadlock. But leases are half the solution. A worker might die after partial execution. Fetch email, start processing, crash, leave it corrupted. Next worker processes garbage.

Idempotency: The Actual Solution

Can't prevent failures. Make them safe to retry. Idempotent execution: run twice, same result. Worker crashes, next worker retries and converges without duplication or corruption. This means agent actions (tool calls, model invocations, database writes) need idempotency keys. Orloj tracks completed actions so you reason about partial executions.

spec:
  agents:
    - name: payment-processor
      model: gpt-4-turbo
      tools:
        - stripe.charge
        - ledger.record
  execution:
    idempotencyTracking: true

Orloj logs every tool call and result. Worker crashes after stripe.charge but before ledger.record? Next worker sees stripe.charge succeeded and skips it. Ledger.record runs. Payment recorded once. Requires deterministic tools and idempotency keys. Most systems support this. If yours don't, add it before production.

Cascading Failures and Isolation

Agent A calls slow tool X and waits. B and C also wait. Every worker thread blocks on X. System grinds. One slow dependency brings down everything. Isolation prevents this. Orloj runs tools in sandboxed containers or WASM with resource limits and timeouts. Slow tool times out, fails fast, doesn't block workers.

spec:
  agents:
    - name: researcher
      tools:
        - web.search:
            timeout: 5s
            retries: 2
            isolation: container
        - email.send:
            timeout: 30s
            isolation: wasm

web.search: 5-second timeout, fail fast, retry twice. Fail twice, task goes dead-letter. email.send: 30 seconds in WASM sandbox. Malicious or buggy tools can't consume unlimited resources. Timeouts are critical. Without them, you're betting dependencies never hang. They always do.

Dead-Letter Handling

Task fails three times. Options: A) retry forever, backlog grows, disk fills, crash. B) give up, lose work. C) dead-letter queue. Orloj chooses C.

spec:
  execution:
    maxRetries: 3
    backoffMultiplier: 2
    maxBackoffCap: 300s
    deadLetterTransition: true

Task fails three times, Orloj applies capped exponential backoff. After third failure, moves to dead-letter. You get alert, inspect in UI, see why, fix, replay. Human-in-the-loop for failures. Agents handle happy paths. Orloj handles failures debuggably.

Task Ordering and Dependencies

Multi-agent systems are DAGs. A produces output. B consumes A. C consumes both. If scheduling ignores the DAG, B runs before A finishes and works with stale data.

apiVersion: orloj.dev/v1
kind: AgentSystem
metadata:
  name: document-pipeline
spec:
  agents:
    - name: parser
      model: gpt-4-vision
      tools:
        - document.extract
    - name: validator
      model: gpt-4
      tools:
        - schema.validate
      dependencies:
        - parser
    - name: summarizer
      model: gpt-4
      tools:
        - text.summarize
      dependencies:
        - parser
    - name: publisher
      model: gpt-4
      tools:
        - database.store
      dependencies:
        - validator
        - summarizer

Parser first. Then validator and summarizer in parallel. Publisher waits for both. Tasks run in order automatically. Publisher before validator finishes? System prevents it. Middle task fails? Downstream blocked until fixed. Prevents silent inconsistency.

Retry Logic and Exponential Backoff with Jitter

Immediate retries are naive. The failure probably still exists. Backoff delays prevent thundering herds where all workers retry at once. Exponential backoff doubles delays, but without jitter all retries happen together, creating synchronized spikes.

spec:
  execution:
    maxRetries: 5
    backoffMultiplier: 2.0
    initialBackoff: 1s
    maxBackoffCap: 60s
    jitterFraction: 0.1

Retries: 1s, 2s, 4s, 8s, 16s, 32s (plus jitter at each). Jitter spreads retries across a window so recovering services don't get crushed. System caps at 60 seconds after five retries. Defaults work most cases. Adjust for your SLAs.

Monitoring Without the Noise

Traditional dashboards don't tell you what's happening. "Is task X stuck?" Can't just check runtime. Maybe it legitimately takes 10 minutes. Need to know: actively progressing (lease renewing) or stuck (lease expired)?

spec:
  observability:
    metricsExport: prometheus
    leaseRenewalMetrics: true
    taskStateChangeMetrics: true
    deadLetterAlerts: true

Orloj exports metrics: tasks in progress, dead-letter, lease renewals per minute, retry frequency, time-to-completion. Alert on dead-letter transitions (unexpected issues need attention). Retry count alerts catch cascading issues before backlog explodes. Monitor task age—old tasks might be stuck or waiting on dependencies that never complete.

State and Consistency Across Restarts

Agent system needs updates. New version ships. Workers restart. What happens to in-progress tasks? Resume where they left off, not restart (expensive and duplicates) and not lost. Orloj persists to backing stores like PostgreSQL. Worker restarts? Claims lease-expired tasks and resumes. Idempotency makes resuming safe.

spec:
  persistence:
    backend: postgresql
    connectionString: postgres://...
    backupSchedule: daily
  execution:
    statePersistence: true

Tasks logged to PostgreSQL. Redeploy? Mid-execution tasks picked up and resumed. Already partially done? Idempotent tools skip re-execution. This is why idempotency is non-negotiable. Without it, restarts are scary.

Putting It Together: A Production Runbook

You've deployed your multi-agent system using Orloj. Here's what you actually monitor and respond to.

Dead-letter count: Alert if > zero. Tasks failed, need intervention.

Task age (p99): Exceeds SLA? Something's queued, not processing. Scaling limits or cascading failures.

Lease renewal failures: Workers crashing or can't contact scheduler. Severity-1 page.

Retry count spikes: Flaky tool, misconfigured permission, or data issue. Check next day.

Downstream blocks: Task in dead-letter blocks dependencies. Fix and replay to unblock.

Runbook: 1) Alert fires, go to dashboard. 2) Inspect dead-letter tasks, see what failed. 3) Fix root cause. 4) Replay from UI/CLI. 5) Task resumes idempotently, completes. Way simpler than hand-debugging multi-agent orchestration across a hundred log files.

Orloj is the orchestration plane, not the agent framework. Still need LLM libraries, tool definitions, model invocation, prompt engineering. Orloj handles infrastructure: scheduling, reliability, governance, observability. Assumes agents are defined and ready. Deliberate boundary: not reinventing frameworks, providing the missing operational layer.

Without Orloj, you're running pre-Kubernetes containers: ad-hoc scripts, no governance, no observability. Start small. Define one system in YAML. Deploy, monitor, hit failures, debug, fix. Gain operational clarity immediately. Docs have quickstart and real patterns. GitHub: github.com/OrlojHQ/orloj. Docs: orloj.dev/docs. Discord: discord.gg/a6bJmPwGS. Running agents reliably is hard. Orloj makes it less hard. Not magical. Engineered.

Related posts