← Blog

Agent Infrastructure Glossary: 30 Terms You Need to Know

Jon Mandraki

If you're evaluating agent frameworks or building agent systems in production, you'll hit a wall of terminology fast. Some of it is jargon. Some is borrowed from systems infrastructure. Some is specific to the agent orchestration problem space. This glossary covers 30 core terms you need to understand. Each definition is practical — no marketing, no hand-waving, just what the term means and why it matters.

Core Concepts

Agent A software entity that perceives its environment through observations, selects actions based on a policy (often a large language model), and executes those actions through tools. In Orloj, an agent is a declared unit with a model, a set of tools, and constraints on what it can do.

Multi-Agent System A directed graph of agents that can call each other, exchange data, and coordinate on tasks. The agents might run sequentially (one calls the next), in parallel (many agents work simultaneously), or in loops (an agent calls itself with new context). Multi-agent systems let you distribute work and reasoning across specialized agents.

Agent Orchestration The process of scheduling agent execution, routing work to the right agents, enforcing policies, and handling retries and failures at the system level. Orchestration happens at runtime. It's what actually makes a multi-agent system run reliably in production. Without orchestration, you have a diagram. With it, you have a system.

Agent System In Orloj, an AgentSystem is the declared YAML resource that defines how agents are wired together — which agents exist, how they connect, what data flows between them, and what constraints apply to the whole system. Think of it as the dependency graph and governance rules in one place.

Agent Runtime The software layer that interprets agent definitions, executes agents, manages task scheduling, enforces policies, handles retries, collects observability data, and provides APIs to interact with running agents. Orloj is an agent runtime.

Orchestration Patterns

Pipeline A sequence of agents where the output of one agent becomes the input to the next. Agent A runs, produces a result, Agent B consumes that result, produces its own, and so on. Pipelines are simple to reason about but can be fragile if intermediate agents fail.

Hierarchical A multi-agent structure where agents at higher levels decompose work and delegate to agents at lower levels, then synthesize the results. A manager agent might break down a complex request, send sub-tasks to worker agents, and combine their answers. Useful for divide-and-conquer problems.

Fan-Out / Fan-In Fan-out: one agent triggers multiple other agents to work in parallel. Fan-in: multiple agents' results are collected and combined into a single result. Together they let you parallelize work (faster) and then merge the outcomes. The challenge is that fan-in only works if you know how to combine N results into one coherent answer.

Swarm Loop Multiple agents operate independently in a loop, each checking for work, processing it, and going idle. Swarms don't have a central orchestrator directing traffic. Each agent pulls from a shared queue or topic. Scaling is often easier (add more agents), but observing and debugging is harder.

DAG (Directed Acyclic Graph) A topology where agents and tasks form nodes, dependencies form edges, and there are no cycles. You can statically analyze a DAG to find the longest path, detect deadlocks, and parallelize execution. Most agent orchestration frameworks enforce DAG-like structures because cycles require careful handling to avoid infinite loops.

Governance

Agent Policy A rule that governs what an agent is allowed to do. Policies can restrict which models an agent can use, how many tokens it can spend, what tools it can access, what data it can read, and what actions it can take. Policies are enforced at execution time, not evaluated retrospectively.

Agent Role A named set of capabilities and constraints assigned to an agent. A "reconciliation agent" role might have permission to access databases, query APIs, and make certain function calls, but not access PII. A "monitoring agent" role might only be able to read metrics. Roles group permissions logically.

Tool Permission An authorization rule that governs whether an agent can invoke a specific tool. Permissions can be binary (allowed/not allowed) or nuanced (allowed if certain conditions are met). Unauthorized tool invocations fail closed — they're rejected at the execution layer, not silently logged.

Fail-Closed When an agent attempts an unauthorized action, the system denies it immediately and returns an error. The agent doesn't proceed as if the action succeeded. Fail-closed is the opposite of fail-open, which would let the action happen but log it for later review (risky). Orloj enforces fail-closed governance by default.

Fail-Open When an unauthorized action happens, the system allows it but logs it. Fail-open is sometimes necessary in legacy systems where blocking everything would break too much. For new agent systems, fail-closed is almost always the right choice.

Audit Trail A permanent, chronological record of every meaningful action taken by an agent: every tool invocation, every decision point, every state change. Audit trails are immutable (you can't retroactively edit them) and timestamped. They're critical for compliance, incident investigation, and understanding what went wrong.

Token Budget A policy limit on how many tokens an agent can consume in a given period (per execution, per day, per month). Once an agent hits its token budget, further model calls are blocked. Token budgets prevent runaway spend and enforce cost discipline across a fleet of agents.

Rate Limiting A policy that controls how frequently an agent can invoke a specific tool or call a model. Rate limits prevent hammering external APIs, reduce the blast radius of bugs, and enforce fair resource sharing when multiple agents compete for the same resources.

Reliability

Lease-Based Ownership When a worker claims a task to execute, it acquires a time-bounded lease on that task (e.g., 30 seconds). While the worker holds the lease, no other worker can claim it. If the worker finishes, it releases the lease and marks the task complete. If the lease expires (worker crashed or hung), another worker can reclaim it. This prevents the same task from being executed twice simultaneously.

Dead-Letter Queue A holding area for messages (or tasks) that failed to process successfully after N retries. Instead of discarding them or retrying forever, dead-lettered messages go into a queue where an operator can investigate, fix the underlying problem, and replay them. Dead-letter handling is critical for unattended production systems.

Idempotency An operation is idempotent if running it once and running it multiple times produce the same result. "Add 1 to the counter" is not idempotent (run it twice, counter goes up by 2). "Set the user's status to 'active'" is idempotent (run it once or ten times, status is still 'active'). Idempotent agents can be safely retried without fear of data corruption.

Retry with Jitter When a task fails, the system retries it. But if all instances fail and all retry at the same moment, you get a thundering herd (a synchronized spike in load). Jitter adds randomness to retry timing: one instance retries after 100ms, another after 350ms, another after 200ms. Retries are spread out instead of synchronized.

Circuit Breaker A mechanism that stops sending requests to a system that is failing or overloaded. When error rates exceed a threshold, the circuit "opens" and new requests are rejected immediately without being sent. After a cool-down period, the circuit "half-opens" and allows a test request through. If that succeeds, the circuit closes and traffic resumes. Circuit breakers prevent cascading failures.

Infrastructure

Model Endpoint The address (URL, API credentials, router logic) where a specific model instance runs. An agent specifies which model endpoints it's allowed to use. Orloj can route calls across multiple endpoints, fall back to alternates if one is down, or implement custom routing logic. A model endpoint is where the actual inference happens.

Model Routing The logic that decides which model endpoint an agent's request goes to. Routing can be simple (always use endpoint A) or complex (send expensive queries to a cheap model, fall back to a premium model if the cheap one fails). Model routing lets you manage cost and performance trade-offs at runtime.

Tool Isolation Running tools in isolated environments so that a malicious or buggy tool can't crash the agent runtime or access unauthorized data. Tools can run in containers, WASM runtimes, or restricted subprocesses. Isolation increases operational safety at the cost of latency and complexity.

WASM Isolation Running tools in WebAssembly sandboxes. WASM provides memory safety, prevents arbitrary system calls, and runs nearly as fast as native code. It's lighter weight than containerization and stronger than no isolation.

Declarative Agent Management Defining agents, systems, roles, and policies in YAML or similar declarative format, then applying those definitions to a runtime. Declarative management means you version-control your agent definitions, review changes before applying them, and can reproduce issues by replaying the definitions. It's the opposite of imperative management, where you construct agents programmatically.

Infrastructure as Code The practice of defining all infrastructure (networks, databases, agents, policies) as versioned code rather than through manual CLI commands or UI clicks. Changes go through code review, can be diff'd, and can be rolled back. IaC is how you run production systems reliably at scale.

Observability

Agent Observability The ability to see what an agent did: which tools it called, what data it accessed, how long it took, whether it succeeded or failed, and why. Observability requires traces (detailed execution logs), metrics (counts and timings), and structured logging (tagged, queryable data). Without observability, you're flying blind.

Cost Attribution Tracking which agent, which model, which execution incurred which costs, and summing them up per agent/team/month. Attribution lets you understand which parts of your system are expensive, set budgets, and optimize spend. Without attribution, you get one big AWS bill and no idea what's driving it.

Trace A detailed record of an execution from start to end, showing every step an agent took, every tool invocation, every decision. Traces are hierarchical (a parent span contains child spans) and timestamped. They're essential for debugging and understanding performance bottlenecks.

Structured Logging Logging that emits messages as key-value pairs or JSON, not as free-form text. Structured logs are machine-parseable and queryable. You can filter logs by agent, by status, by timestamp, by error type. Structured logging is what makes observability at scale possible.


How These Terms Fit Together

These 30 terms describe the problem space of agent infrastructure. A practical agent system in production uses most of them:

You define your agents in an agent system (declarative YAML). The agent runtime (like Orloj) interprets those definitions and manages execution. Agents are organized into a DAG or pipeline based on how they call each other. Each agent has an agent role that restricts it to certain tool permissions. The runtime enforces fail-closed governance: unauthorized actions are rejected. When an agent calls a model, the runtime routes the request to a model endpoint using model routing logic. Long-running executions use lease-based ownership to prevent duplicate work. Failed tasks are retried with jitter, and if they fail repeatedly, they go to a dead-letter queue. Agents respect token budgets and rate limits. Tools run in isolated environments. Every execution generates traces and structured logs that feed into an audit trail. Cost attribution rolls up spend per agent. Observability tools let you query what happened and debug incidents.

The glossary entries above provide the foundation. Use them to understand product documentation, evaluate frameworks, and have precise conversations with colleagues.

Related posts