← Blog

What Is Orloj? A Technical Overview for Platform Engineers

Jon Mandraki

Orloj is an open-source orchestration plane for multi-agent AI systems. You define agents, tools, policies, and workflows in YAML. Orloj handles scheduling, execution, governance, and reliability.

It's not an agent framework, model provider, or managed service. It's the infrastructure layer between agents and production. Think of it like Kubernetes for agents. The architecture speaks for itself.

The Problem

Running agents in production looks like running containers before Kubernetes: ad-hoc scripts, no governance, no observability, no standard way to manage an agent fleet.

Platform teams build bespoke orchestration. You wire up scheduling, add policy layers, add retry logic, add tracing. You code it, version it separately from agent definitions, and maintain two sources of truth.

Orloj consolidates this. You declare your agent system once in YAML. The same manifest describes agents, models, tools, policies, and workflows. Deploy once. Orloj handles the rest.

Core Resource Model

Five resource types make up an Orloj system: Agent, AgentSystem, Tool, Policy, and WorkflowTemplate. You don't need all five to start, but they form the foundation of how Orloj thinks about infrastructure.

Agent defines an LLM-backed executable. It specifies which model to use, the system prompt, which tools it can call, and error-handling behavior. An Agent is stateless — no schedule or persistent state. Agents run only when invoked.

AgentSystem composes agents into a directed graph. You wire agents together declaratively, specifying which outputs feed into which inputs and under what conditions. An AgentSystem might chain a routing agent to specialized agents, or run them in parallel and merge results. The graph is explicit in the manifest.

Tool declares an executable endpoint: an HTTP API, webhook, database query, or local command. Tools live outside Orloj. Orloj knows how to invoke them, what inputs they expect, and what outputs they produce. When Agent A is declared to call Tool X, Orloj enforces that boundary at execution time.

Policy expresses authorization constraints. A policy says "this agent can call these tools" or "only requests tagged with customer-id=123 can run this workflow." Policies fail closed — unauthorized calls are rejected before reaching the tool. Policies are composable and evaluated at execution, not bolted on.

WorkflowTemplate defines a reusable invocation pattern for an AgentSystem. It specifies input parameters, output expectations, timeout behavior, and retry strategy. You can instantiate the same template multiple times with different parameters and get independent execution contexts.

This all lives in YAML, no code required.

Architecture: Server and Workers

Orloj uses a two-tier architecture: a control plane server and stateless workers.

The server manages manifests (parse, validate, store), manages the task queue, tracks state, enforces policies, and exposes an API to invoke workflows. It doesn't execute agents. It coordinates.

Workers pull tasks from the queue, execute agents, call tools, handle retries, and report results. Workers are stateless. You can run one on your laptop or scale to hundreds in Kubernetes. Workers hold time-bounded leases on tasks — if a worker crashes, the lease expires and the server re-queues.

When you invoke a workflow, the server enqueues root tasks. A worker picks one up, executes the agent, evaluates the graph, and enqueues downstream tasks. The worker reports back. The server's queue is the spine.

This design has concrete consequences:

  • Horizontal scalability — Workers are replaceable. Add workers to add capacity.
  • Fault tolerance — Worker death is operational, not a correctness bug. Leases prevent stuck tasks from poisoning the queue.
  • Auditability — Every task, execution, and policy decision goes through the server and is logged. Complete audit trail.
  • Testability — Run the server and workers on your laptop, Docker, or Kubernetes. Same interface.

Scheduling happens via the work queue and lease system, not cron or Kubernetes CronJobs. For daily workflows, wire a WorkflowTemplate to an external scheduler, or use a scheduler template (roadmap) within Orloj.

Governance Is Enforced

Policies are evaluated at task execution time, before the agent runs, before any tool call.

Say you have a policy: "only engineering teams can call the deployment tool." When a worker executes an agent that tries to call it, the server evaluates the policy. If the caller isn't in engineering, the task transitions to a policy_denied state. The agent doesn't run. The tool doesn't get called. No back-and-forth, no silent failures.

Policies can reference:

  • Agent identity — which agent is acting
  • Requester identity — who requested the workflow
  • Tool identity — which tool is being called
  • Context — tags, labels, or metadata on the workflow
  • Time — policies can be time-dependent

Compose policies using boolean logic. One says "platform team only." Another says "not during maintenance windows." Attach both to the same workflow and they both must pass.

This differs from most agent frameworks, which treat authorization as a library or post-execution check. In Orloj, governance is a runtime concern. You can't bypass it.

Reliability Patterns Built In

Orloj ships with patterns you'd have to build yourself elsewhere.

Retry with jitter — If a tool call fails from transient errors, the worker retries automatically. Configure per Tool: which errors to retry, how many times, what backoff. Jitter prevents thundering herds when multiple tasks retry at once.

Idempotency tracking — Tool calls are tracked by request ID. If a tool is called twice with the same ID, it knows it's a retry and behaves accordingly: return cached result, skip side effects. Insurance against duplicate work.

Dead-letter handling — If a task fails after retries, it goes to a dead-letter queue instead of buried logs. You can examine, fix, and replay it.

Lease expiration — A worker holds a time-bounded lease on a task. If it crashes, the lease expires and the server re-queues. Prevents hung workers from blocking workflows.

Graceful cancellation — WorkflowTemplates have a timeout. If execution exceeds it, the server signals all running tasks to cancel. Workers stop cleanly.

These are part of the core execution model, not bolted on.

Deployment Models

Orloj works in any environment.

Local development — Run the server and worker on your laptop. Version-control your YAML manifest. Make changes, restart, test. Like Docker Compose.

Single-server — Run the server and workers on one machine or VPS. Good for modest scale. The server runs on existing infrastructure. Workers are processes.

Kubernetes — Run the server as a Deployment with persistent storage (etcd or Postgres). Run workers as a Deployment that scales by queue depth. Native Kubernetes observability and resource management. Just containers and service definitions.

VPS / self-hosted — Install Orloj binaries and run them as systemd units. The server needs persistent storage (local disk or volume). Workers don't.

All modes use the same manifest and API. Operational patterns are identical. Choose based on infrastructure and readiness, not because Orloj differs.

Design Decisions

Why YAML?

YAML is version-controllable and language-agnostic. Teams don't lock into a particular SDK or language. DevOps engineers who don't know Python can read and modify Orloj manifests. YAML has a learning curve, but your infrastructure team has already learned it.

Why server/worker?

Monoliths are simpler to build but don't scale horizontally. Server/worker splits the constraint: the server is stateless (behind a load balancer), workers scale to handle load. The separation also provides a clean audit trail — everything flows through the server.

Why fail-closed?

In production, failing open (allowing denied action) is catastrophic. Failing closed (rejecting denied action) is inconvenient. We chose inconvenience. If a policy denies a tool call, it's denied. Grant access by changing the policy and redeploying. That's a deliberate change, not an accidental bypass.

Why declarative?

Declarative systems have a single source of truth and clean diff semantics. You read the manifest to answer "what should be running?" You run git diff to answer "what changed?" Imperative systems scatter truth across API calls and mutations. Declarative is harder to build but easier to operate.

What Orloj Doesn't Do

Orloj is not an agent framework. Don't use it to train models or write agent logic. Use LangChain, AutoGen, or raw SDK calls. Orloj orchestrates agents you've already built.

Orloj is not a model provider. It doesn't run models. It calls APIs (your own, OpenAI, Anthropic) on behalf of agents. You choose the model, provider, and API key. Orloj invokes it.

Orloj is not a managed service. It's open-source and self-hosted. You run the server and workers, manage the database and storage. We provide code, not ops.

Orloj is not an agent IDE or notebook environment. No web UI for visual agent building (not yet). Write YAML, commit to git, deploy.

Orloj does not provide model observability beyond task-level metrics. You get execution logs, task state transitions, policy decisions, and retry counts. No LLM token tracking or cost attribution (that couples Orloj to specific providers). Use native provider observability for that.

When to Use Orloj

Good fit:

  • You're running multiple agents in production and need coordination and governance.
  • You want agent definitions version-controlled with your infrastructure.
  • Your team uses Kubernetes or has ops experience with stateless services.
  • You care about audit trails and compliance.
  • You need to scale agent execution without scaling code.

Not a fit:

  • You're prototyping agents in a notebook. Use an agent framework, graduate to Orloj for production.
  • You need a fully managed service.
  • Your agents are simple one-shot calls with no coordination. A cron job might be enough.
  • You need visual workflow building. Orloj is YAML-first.

Getting Started

The quickstart is in the docs. Clone the repo, run orloj server, run orloj worker in another terminal, deploy the example manifest, invoke a workflow. Five minutes.

Manifests are self-documenting. Read through an example Agent, AgentSystem, and WorkflowTemplate. YAML structure makes the operational model explicit.

When evaluating Orloj, answer:

  1. Do you have agents in production you're managing ad-hoc?
  2. Does your team have Kubernetes or Docker Compose experience?
  3. Do you need governance and observability?

Yes to all three means Orloj is worth evaluating. Building agents for the first time? Start with an agent framework. You'll know when you need Orloj.

The Discord community is active. GitHub issues are the backlog. The roadmap is public. No sales team, no free-tier lock-in, no surprise pricing. Open-source infrastructure. Use it if it fits your problem.

Related posts