← Blog

Declarative Agent Management: Why YAML Beats Imperative Code for Production

Jon Mandraki

Running AI agents in production today looks like containers before Kubernetes. Ad-hoc scripts, permissions scattered through code, retry logic hardcoded, no clear record of what's running. Something breaks at 3am, you dig through Python. You want to change permissions, you edit code and hope.

This is the wrong model. But understanding why requires looking at what actually happens when you try to run agents at scale.

The Problem with Imperative Agent Code

Let's be concrete. Suppose you have an agent that can access a database, write to S3, and call an internal API. Here's what imperative Python looks like:

agent = Agent(
    name="data_processor",
    model="claude-3-5-sonnet",
    tools=[
        DatabaseTool(
            connection_string=get_secret("db_url"),
            allowed_databases=["analytics", "staging"]
        ),
        S3Tool(
            bucket="reports-prod",
            prefix="daily-exports",
            credentials=get_secret("aws_key")
        ),
        InternalAPICaller(
            base_url="https://api.internal.example.com",
            api_key=get_secret("internal_api_key"),
            rate_limit=100
        )
    ],
    max_retries=3,
    retry_backoff=2,
    timeout=300,
    system_prompt="You are a data processing agent..."
)

But it's missing things:

  • Permissions are hidden. Tool definitions live in Python. A security engineer traces credential lookups and guesses what databases the agent can access. The allowed_databases string isn't verifiable from outside.

  • Retry logic is baked in. Want to switch from backoff to jitter? Edit code. Want different retry for different tools? Wrap each separately. Maintenance problem.

  • No drift detection. What runs in production? Maybe what's in main. Maybe not. If someone SSH'd in and edited the script, code diverges silently. Containers solved this with manifests. Agents don't have it.

  • Auditing is hard. Need to know permissions on March 15? Check git log and reconstruct. Policy changes aren't reviewable events. They're commits buried in PRs that changed other things.

This is fine for a one-off script. For production agent fleets running workloads you care about, it's a problem.

Why Infrastructure-as-Code Thinking Applies to Agents

Infrastructure-as-code solved these for containers. The core: don't manage infrastructure by running commands or editing live systems. Describe your desired state in a version-controlled file. The system enforces it.

Agents follow the same logic:

Version control. Every change is a git commit with author, timestamp, message. Something breaks, you know when and by whom.

Auditability. YAML is readable. Permissions aren't scattered. They're one block in a PR. Your compliance team sees what the agent accesses and approves before launch.

Drift detection. Orloj compares declared state (manifest) to actual state (running). If someone changes config outside deployment, you get an alert. Essential for production.

Reproducibility. Same manifest to staging and prod means same behavior. No "works on my machine." No hidden environment variables.

Rollback. Deployment breaks? Revert to previous manifest and reapply. One command with full history.

These aren't easy to add to Python. They need the system built on declarations.

Declarative vs. Imperative: A Side-by-Side

Here's the same agent in Orloj YAML:

apiVersion: orloj/v1
kind: Agent
metadata:
  name: data_processor
  namespace: default
spec:
  model: claude-3-5-sonnet
  system_prompt: |
    You are a data processing agent...
  tools:
    - name: database_query
      type: tool_ref
      ref: analytics-db
      permissions:
        - action: query
          resource: databases
          constraints:
            allowed_databases: ["analytics", "staging"]
    - name: s3_export
      type: tool_ref
      ref: s3-writer
      permissions:
        - action: write
          resource: s3://reports-prod/daily-exports
    - name: api_call
      type: tool_ref
      ref: internal-api
      permissions:
        - action: invoke
          resource: https://api.internal.example.com
          constraints:
            rate_limit: 100
  runtime:
    timeout: 300s
    retry:
      max_attempts: 3
      strategy: exponential_backoff
      backoff_base: 2

Security review becomes simpler:

The manifest is readable. Permissions are explicit. A git diff shows exactly what changed and who approved it. Require security signatures on commits.

Change retry strategy from backoff to jitter? Edit the retry block. No code, no recompile, no retest. Apply the manifest and Orloj enforces it. Changes are instantly visible in git.

What This Enables in Practice

Declarative configuration enables workflows impossible with code:

GitOps for agents. Agent configs live in GitHub. A PR bumps retry timeout and adds a tool. Security team reviews the diff, approves. Once merged, Orloj applies it automatically. No manual commands.

Permission drift detection. Orloj checks running state against the manifest. Tool permission changed outside version control? You get an alert with the diff. Catches rogue or accidental changes.

Reproducible incident response. Agent misbehaves at 3am. Roll back to previous manifest, apply, verify it's fixed. Then investigate, fix, redeploy. Entire trail is in git.

Compliance auditing. Auditor asks "what permissions did agent X have on date Y?" Check out that commit, read the manifest. Answer is immediate and verifiable. Trace every permission change and approval.

Cross-environment consistency. Manifests for staging and prod are identical except for the model endpoint. Test in staging, promote to prod. Same definition, different runtime.

The Trade-Offs

Declarative isn't always better. Some agent behaviors need imperative logic.

If an agent dynamically chooses tools based on context, the manifest declares available tools and permissions. The agent's reasoning chooses which to use. That's fine. Permissions are enforced regardless.

If you compose tools dynamically based on API responses, that's imperative and belongs in code. But the security boundary—what an agent can do—is declarative. Permissions check at execution regardless of internal logic.

If a tool needs complex initialization, parameterize it in the manifest (environment variables, secrets, config) and keep heavy lifting in code. Manifest declares the contract. Code handles implementation.

The boundary: operational policy belongs in the manifest. Retry strategy, timeout, permissions, observability, blast radius. Agent reasoning and tool implementation belong in code.

How Orloj Implements This

Orloj's manifest system:

Declare agent config, tools, permissions, policies in YAML. Lives in version control.

Apply to Orloj. System validates, shows the diff. You approve and it deploys.

Monitor for drift. Orloj checks that running state matches declared state and alerts on divergence.

Rollback by reapplying an older manifest. Full history in git.

Orloj enforces policy at execution. Unauthorized tool calls fail closed. If a manifest limits queries to specific databases, the runtime blocks others. No exceptions, no code overrides.

Why This Matters

At small scale, this doesn't matter. One agent in one environment? Imperative code is fine.

At production scale, you need infrastructure discipline. Changes must be reviewable, auditable, reversible. You need to know what's running and prove it. You need to prevent permission escalation and detect bypasses.

These are basic operational requirements. We've solved them for containers, databases, network policy. Orloj applies the same to agents.

YAML isn't as fun as Python. But when explaining to your security team why an agent has production database access, a readable manifest with approval trails beats git blame every time.

That's Orloj.

Related posts