← Blog

Introducing Orloj: Agent Infrastructure as Code

Jon Mandraki

Building reliable, governed multi-agent systems is painful. Right now, most production agent deployments are a mess of brittle scripts and vendor lock-in. So I built Orloj: agent infrastructure as code.

I've built distributed systems for most of my career. Kubernetes failure analysis with kroot, blockchain node infrastructure at LinkPool, smart contract security tooling at Drosera. Every one of those domains eventually hit the same point: the thing that needed building wasn't a better application, it was better infrastructure underneath the applications.

AI agents hit that same point. The frameworks handle building agents fine. Everything around them is the problem: the permissions system someone hacked together in a week. The retry logic copy-pasted from Stack Overflow. The "governance layer" that's three if statements and a Slack webhook.

What Orloj actually is

Orloj is an open-source orchestration plane for multi-agent AI systems. You define agents, their tools, their permissions, and their workflows in declarative YAML manifests. The runtime handles scheduling, execution, governance enforcement, and reliability.

apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: compliance-checker
spec:
  model: gpt-4
  tools:
    - name: scan-documents
      permissions:
        - read: compliance_docs
    - name: flag-issues
      permissions:
        - write: issue_tracker
  governance:
    maxTokensPerMinute: 8000
    requireApproval:
      - write: issue_tracker
    auditLog: true

That manifest is the source of truth. Version it, review it, diff it in a PR, audit it when regulators come knocking. The runtime enforces everything declared there. An agent that tries to access a tool it doesn't have permission for gets denied at the execution layer, before the call even reaches the tool. Fail-closed, not fail-silent.

Why this exists

The short version: running AI agents in production today looks like running containers before Kubernetes. Ad-hoc scripts, no governance, no observability, no standard way to manage an agent fleet.

The longer version starts with a question I kept hearing from engineering teams: "How do we actually run these things?" Not "how do we build an agent" but how do we:

  • Control what each agent can access
  • Know what happened at 3am when something fails
  • Handle partial failures without losing work
  • Scale from one agent to fifty
  • Prove to compliance that agents stayed in bounds

Nobody had a good answer. Every team I talked to was either rebuilding this infrastructure from scratch or just... not doing it. Running agents with no governance and hoping nothing went wrong. In 2026, with agents touching production databases and customer data, that's not acceptable.

How it works

Orloj runs on a server/worker architecture. The server manages state, schedules work, enforces governance policies. Workers pick up tasks, execute agent logic, report back. This separation matters for a few reasons.

Workers can scale independently. If your agent workload spikes, add workers. The server doesn't care how many workers are running. Failed tasks get retried according to the policy you defined in the manifest (capped exponential backoff, jitter, configurable max attempts). If a task fails too many times, it goes to a dead-letter queue instead of disappearing.

Workers hold time-bounded leases on tasks. If a worker dies mid-execution, the lease expires and another worker picks up the work. No zombie tasks. No paging someone at 2am to manually restart a process because a workflow got stuck.

This is different from most agent frameworks. Typically, your agent runs inside your application process. Agent fails, your application deals with it. Want concurrent agents? Manage concurrency yourself. Want to scale? Scale the entire application. Orloj separates operational concerns from agent logic so you can handle them independently.

What governance looks like in practice

I keep using the word "governance" and I realize it can sound like corporate overhead. Here's what I actually mean.

When you deploy an agent through Orloj, the runtime knows exactly what that agent is allowed to do. The manifest declares the boundaries and the runtime enforces them. No application-level permission checks that might have a bug. An agent with read: analytics_db can query that database. It cannot write to it, and it cannot touch any other database, regardless of what the LLM decides to try.

Rate limits work the same way. If you set maxTokensPerMinute: 8000, that's enforced at the execution layer. The agent can't burn through your OpenAI budget because it got into a loop. Approval requirements mean certain actions pause and wait for human sign-off before executing. Audit logging captures the full lifecycle of every action: requested, approved or denied, executed.

None of this is optional. It's not a plugin you can skip. It ships with the runtime because governance without enforcement is just documentation, and documentation doesn't stop your agent from deleting a production table.

What this is not

Orloj is Apache 2.0. No hosted version, no premium tier, no open-core trick where the free version is missing the features you actually need. The entire codebase is at github.com/OrlojHQ/orloj.

It's v0.1.0. I'm telling you upfront because the problem is real, the architecture works, and feedback from people running agents in production matters more than three more months of polish. There are rough edges. Some roadmap features aren't built yet.

Getting started

Five minutes to a running system:

curl -sSL https://get.orloj.dev | sh
orloj init my-project
orloj apply -f agent.yaml
orloj status

The repo includes example manifests for common patterns. The docs walk through everything from single-agent setups to multi-agent workflows with governance policies.

What's next

The roadmap is public. Near-term priorities are approval workflows (human-in-the-loop for sensitive operations), compliance templates for regulated industries, and observability tooling (dashboards, tracing, anomaly detection on agent behavior).

Longer-term, multi-tenancy support so platform teams can offer Orloj as an internal service, and disaster recovery tooling for when things go sideways.

What I want most right now is feedback. If you're running agents in production, what's broken about your current setup? If you're not running them yet, what's blocking you? What would make Orloj worth trying?

GitHub: github.com/OrlojHQ/orloj Docs: orloj.dev/docs Discord: discord.gg/a6bJmPwGS Twitter: @OrlojHQ

Related posts