The Missing Layer: Why Agent Frameworks Need an Orchestration Plane

Jon MandrakiApril 7, 2026

Two years ago, if you wanted to run containers in production, you'd write shell scripts. String together some cron jobs. Maybe add a custom scheduler. Ship it, cross your fingers, wake up at 2am to fix it.

Then Kubernetes arrived and changed the model: declare what you want, let the platform handle the rest.

Today, if you want to run AI agents in production, you're back to shell scripts.

You build an agent with LangChain or CrewAI. You deploy it to your cloud. You add logging. You add monitoring. You write a custom scheduler. You implement retry logic. You add permission checks. You build state tracking, restarts for failed runs, audit logs. You handle failures you didn't predict.

And none of it feels production-grade. It feels like containers in 2012.

This is the missing layer.

The Current Stack Has a Hole in It

The modern agent stack looks like this:

Layer 1: Frameworks — LangChain, CrewAI, AutoGen. These give you primitives for building agents: tool calling, context management, multi-step reasoning. They're optimized for the thing that looks cool in a demo: agent logic.

Layer 2: Compute — AWS, GCP, Azure, Kubernetes. These give you servers, containers, networking, persistent storage. They're generic — optimized for running anything, which means optimized for nothing in particular.

Layer 3: ?

That's the gap. Frameworks don't scale. They have no governance. They can't retry intelligently when a model call fails midway. They lack lease-based task ownership to prevent concurrent runs. They weren't designed for operational reality.

And cloud layers are too generic. You can run agent code on Kubernetes or Lambda. That doesn't solve agent-specific problems. You wouldn't run Postgres on raw containers without an operator. You wouldn't run a queue without a controller. Why run agents without a layer built for them?

What Frameworks Can't Do

LangChain, CrewAI, and AutoGen are good at what they're designed for. They're not designed for operations.

Frameworks optimize for expressiveness and speed. They make it easy to build agents. They don't include governance because it's separate from agent logic. Governance is what an agent is allowed to do, not how it reasons.

Frameworks don't have job scheduling. They don't understand deadlines or backpressure. They don't track task ownership across workers. They don't ensure idempotency. They're not built for operators managing them at scale.

You could add these features. You'd be stapling on a different problem entirely. Governance wouldn't be a declared property of the agent. It would be orchestration logic you write. Reliability wouldn't be automatic. You'd implement it in code. Observability would be a library, not native.

Frameworks solve agent building. They don't solve operations.

What Cloud Providers Can't Do

Cloud providers are too generic.

Fargate runs agent code. Kubernetes orchestrates it. Neither understands agents. Kubernetes doesn't know that agent tool failures need exponential backoff with jitter, different from failed HTTP requests. It doesn't know agents need visibility into token usage, tool routing, model calls. It has no primitives for "this agent can call A and B, not C."

You can build this on Kubernetes. Many companies have. But you're writing an orchestration plane while paying for a generic one that doesn't help.

That's redundant.

The Analogy Still Holds

Running AI agents in production without an orchestration plane is like running containers without Kubernetes.

You can do it. Borg existed. Many internal systems run on ad-hoc container scripts. It works until it doesn't. And when it does fail, you're debugging a homegrown scheduler instead of shipping product.

The Kubernetes analogy isn't perfect. Orloj is younger than Kubernetes was when Docker launched. The agent space is still defining production. But the structural problem is the same: frameworks solve logic, cloud solves compute, nothing solves operations.

CrewAI themselves published "A Missing Layer in Agentic Systems?" in January 2026. They identified the exact gap we're talking about. They didn't claim they'd fill it — their focus is agent logic, and that's the right choice. But they were explicit that the gap exists.

What the Missing Layer Is

This layer is infrastructure for agents. Not better frameworks. Not more cloud features. A separate concern: orchestration, governance, reliability, observability built in.

It handles:

Declarative configuration: Define agents, tools, permissions, workflows as version-controlled YAML. Apply with a single command like Kubernetes. No code, no scripts.

Governance enforced: Authorization, policies, permissions at execution time. Not optional. Not a plugin. Built in. Unauthorized tool calls fail closed.

Task ownership: Work queues with leases. When a worker picks up a task, it holds a time-bounded lease. If the worker dies, another takes it. No ghost runs. No concurrent execution.

Reliability built in: Exponential retry with jitter. Dead-letter handling for tasks that exhaust retries. Idempotency tracking so replayed tasks don't duplicate side effects. Message queue patterns applied to agents.

Agent-specific observability: Not just metrics and logs. Visibility into tool calls, which models, token usage, decision paths. What happened during the run, not just success or failure.

How This Differs from Frameworks

This is critical: orchestration plane agents aren't wrappers around LangChain or CrewAI. They're agents built in the orchestration layer.

You define an agent as a manifest. It declares the model, tools it can call, behavior constraints, error handling. The orchestration plane executes it. You don't build in Python then wrap it. You declare the agent, and the plane handles execution.

That's how Orloj works. It avoids bolting production infrastructure onto frameworks not built for it. Agents are part of infrastructure from the start.

Why Now

The agent space has reached a pivot point. A year ago, most deployments were experiments. Now, enterprises want SLOs. They ask: can I run agents like I run databases? Can I get audit logs? Can I enforce policies?

Frameworks and clouds say "maybe, if you build it yourself."

An orchestration plane fills this gap. Not a framework. Not cloud. In between. It speaks operations, not just agent logic.

Orloj is built on this. Agents as infrastructure. Governance built in. Reliability patterns from the modern stack applied to agents. Declarative YAML so you version control agents like you version control deployments.

We're not reimagining how agents think. We're building the operations layer that should exist.

The Missing Layer: Why Agent Frameworks Need an Orchestration Plane

The Current Stack Has a Hole in It

What Frameworks Can't Do

What Cloud Providers Can't Do

The Analogy Still Holds

What the Missing Layer Is

How This Differs from Frameworks

Why Now

Related posts

What Is Agent Orchestration?

Orloj vs. LangGraph vs. CrewAI: 2026 Update

Why Every Agent System Needs a Governance Layer (Not Just Guardrails)