AI Agent Containment Starts at the Environment Layer

26 May 2026

Anthropic just published how they contain Claude. The number that should stop every platform team: under prompt injection, in a controlled test, Claude completed credential exfiltration 24 times out of 25. The most capable model in the world, wrapped in its maker’s own defences, leaked secrets 96% of the time once an attacker controlled the input.

The lesson isn’t that Claude is unsafe. It’s that no model — however well aligned — can be the last line of defence for an AI agent. Anthropic says so themselves. And if that’s true inside Anthropic, it’s true for every team running an MCP fleet on someone else’s model.

The model is the wrong place to enforce security

Anthropic’s containment architecture has three layers:

Layer	What it is	Guarantee
Environmental controls	Sandboxes, egress allowlists, deterministic boundaries	Hard. Enforced regardless of model behaviour
Model-layer defences	Training, system prompts, classifiers	Probabilistic. “Will never be 100% effective”
External content gating	Controls on tools, connectors, and data entering context	Deterministic interception at the boundary

Their design principle is explicit: “Design for containment at the environment layer first, then steer behaviour at the model layer.” Model-layer defences, in their words, “will never be 100% effective, which is why it can’t stand alone.”

Why do model defences fail so predictably? Because they “anchor on user intent.” Prompt injection rewrites the apparent intent. The agent believes it’s helping the user; it’s actually helping the attacker. That 24-of-25 exfiltration was stopped — when it was stopped — only by deterministic egress controls at the environment layer. Not by the model noticing it was being played.

Anthropic’s threat model names three sources of risk: user misuse, model misbehaviour, and external attackers. For an MCP fleet, all three converge on a single chokepoint — the tool call. This is the same three-layer logic Bain applied to agentic AI: the enforceable layer is the one below the model.

MCP fleets multiply the attack surface

Every MCP server you connect is a bundle of tools your agent can invoke. And remote MCP servers — the common case — “can change behaviour at any point”, unlike a local binary you can audit once and trust. You don’t own the upstream. A tool that read a calendar yesterday can exfiltrate it today, and your agent has no way to know the contract changed.

Now multiply by a fleet. Dozens of engineers, each running Claude Code, Cursor, or Codex, each pointed at a dozen MCP servers. The attack surface is people × clients × servers × tools. Prompt-level guidance does not scale across that matrix. Neither does trusting each engineer to vet each server before they wire it in.

Where enforcement actually has to live

Anthropic names the mechanism precisely, in their External Content Gating layer:

“Tool-call interception via proxies that enforce network and file policy and can inspect return values before they enter the model’s context.”

A proxy. In the request path. Deterministic. That is the architecture — and Anthropic built it internally for their own products. PolicyLayer is that layer for everyone else’s MCP fleets: fleet-wide, vendor-neutral, and enforced before the call ever reaches the upstream. It’s runtime governance at the transport layer, not another guardrail bolted onto the prompt.

AI client  ──▶  PolicyLayer proxy  ──▶  upstream MCP server
                       │
                       ├─ authenticate per-person token
                       ├─ evaluate tools/call against policy  → allow / deny
                       └─ write durable audit record

The agent thinks it’s talking to GitHub, or Linear, or your internal MCP server. It’s talking to PolicyLayer, which evaluates every call against a deterministic policy and forwards only what’s permitted.

What PolicyLayer enforces today

Every tools/call is evaluated before it reaches the upstream. Not a prompt. A rule. The decision is identical whether the agent is helpful, confused, or compromised.
Per-person scoped tokens. Each engineer routes through their own token. Policy and audit bind to a person, not a shared key.
Registered upstreams only. You declare which MCP servers exist; an unknown upstream isn’t reachable through the proxy.
tools/list is filtered. The agent only sees the tools its policy permits — you shrink the attack surface before the model ever considers a call.
Fail-closed. A grant with no policy attached is deny-all at the engine. The default posture is “no”, not “yes”.
Deterministic quotas. Per-tool and cross-tool rate limits are enforced on a reserve-and-rollback path, so a looping agent can’t burn a tool unbounded.
Durable audit. Every request is recorded independently of the model’s own account of what it did.

Routing a client through PolicyLayer is a config change, not an SDK rewrite:

// .cursor/mcp.json — the client points at PolicyLayer, not the upstream
{
  "mcpServers": {
    "github": {
      "url": "https://proxy.policylayer.com/mcp/<server-uuid>/",
      "headers": {
        "Authorization": "Bearer <your-scoped-token>"
      }
    }
  }
}

The client believes it’s reaching GitHub’s MCP server. It’s reaching PolicyLayer, which evaluates every call against the policy bound to that token:

{
  "version": "1",
  "default": "deny",
  "tools": {
    "list_issues": {},
    "create_issue": {
      "require": [
        {
          "conditions": [
            { "path": "args.repo", "op": "regex", "value": "^policylayer/" }
          ]
        }
      ]
    }
  }
}

Deny by default. This token can list issues, and open them only in repos under policylayer/. Deleting a repository, reading a private org’s code, opening an issue somewhere else — anything outside the rules never reaches GitHub, regardless of what the model was talked into. That’s the difference between deterministic policy and a guardrail.

Why a dedicated gateway beats rolling your own

The most revealing admission in Anthropic’s post is about their own code: “The software you build yourself is often the weakest.” Their hand-rolled proxies and allowlist implementations failed under adversarial testing, while “battle-tested hypervisors, syscall filters, and container runtimes” held. They cite specifics — a symlink that had to be resolved before path validation or it escaped the sandbox; an exfiltration path that slipped through an approved-domain allowlist.

If Anthropic’s engineers ship containment bugs in custom proxies, the team standing up a quick MCP allowlist on a Friday afternoon will too. A proxy in the request path is security-critical infrastructure: it parses untrusted input, holds upstream credentials, and makes allow/deny decisions under concurrency. Get it wrong and it fails open. That is exactly the kind of component you want hardened once, by a team that does only this — not reimplemented, subtly broken, at every company that adopts MCP.

Where this is going

Anthropic’s framing points straight at the next set of controls, and the category moves with it:

Response inspection — examining return values before they enter the model’s context, the vector behind tool-result injection attacks.
Exfiltration as a first-class concern — reasoning about data leaving through an approved tool, the problem in blocking outbound exfiltration via fetch.
Drift detection — catching when a remote tool’s behaviour diverges from what you registered.

Each is a natural extension of a deterministic gate that already sits in the path — and the position is what makes them possible. You cannot inspect, constrain, or audit a tool call you never see. PolicyLayer sees every one.

Why this matters

The question was never “is our model safe?” Anthropic just demonstrated that the best-defended model on the market leaks 24 times out of 25 when an attacker writes the prompt. The question is whether anything deterministic sits between your agents and the tools they can reach.

If the answer is “we trust the prompt,” you don’t have an answer. The environment layer is the only place containment is enforceable, auditable, and bounded. That’s the layer PolicyLayer operates in — and, on Anthropic’s own evidence, the layer that has to hold.

The NSA just made the case for a policy layer in front of MCP — the NSA’s MCP security guidance, mapped to enforcement
What is MCP policy enforcement? — the category, defined
Anthropic’s MCP playbook is for builders. Defenders need the next layer. — companion piece
Why prompt guardrails fail at agent safety — the probabilistic-defence problem
MCP security beyond guardrails — runtime enforcement

Docs:

Quick Start — register a server, write a policy, route a client
Writing policies — the full policy language
Core concepts — servers, grants, policies, the proxy