← Back to Blog

Cap LLM Token Spend on MCP Agents: Cost-Scaled Limits Beyond Call Counts

The bill landed at 02:14. An agent had spent the night looping on a flaky tool, retrying the same request with ever-larger context windows, and each retry burned roughly thirty thousand tokens against an Anthropic-fronting MCP server. By the time anyone noticed, the day’s inference spend had cleared four figures. The agent never exceeded its per-minute call-count rate limit. It did not have to. Token cost is unevenly distributed across calls — one summarisation request can outweigh a thousand metadata lookups — and counting requests tells you nothing about what those requests cost.

The Problem with Call-Count Limits

A standard rate limit treats every request as equal weight. Fifty calls per minute means fifty calls per minute, whether each call asks for a fifty-token completion or a thirty-two-thousand-token one. The cheap call and the expensive call increment the same counter by one.

That mismatch is acceptable for tools where every call costs roughly the same — a database lookup, an internal CRUD endpoint, a webhook trigger. It collapses for LLM-fronting MCP servers, where a single messages.create can request max_tokens: 64000 and cost several dollars on its own. An agent that throttles itself to ten calls per minute can still burn through a hundred thousand tokens in that minute. The throttle protects against request floods. It does not protect against cost.

A second failure mode: agents that retry. When a tool returns an error, most agent loops re-issue the call with the previous context appended. Each retry is heavier than the last. Call counts climb linearly. Token spend climbs geometrically. By the time the call-count alarm fires, the damage is done.

Cost-aware limiting needs to track the resource that actually drives the bill, not the number of HTTP requests carrying it. In practice, PolicyLayer can enforce against numeric fields present in the MCP tool-call arguments, such as a requested max_tokens or an internal gateway’s estimated_tokens field. It does not calculate provider-billed input and output tokens on its own before the call runs; the upstream tool has to expose the number you want to meter.

Cost-Scaled Limits in PolicyLayer

PolicyLayer’s Limits primitive supports increment_from, which reads a numeric value out of the request payload and uses that as the increment instead of 1. Point it at args.max_tokens and every request increments the counter by whatever the agent asked the model to generate.

{
  "version": "1",
  "default": "allow",
  "tools": {
    "messages": {
      "limits": [
        {
          "counter": "requested_tokens",
          "window": "hour",
          "scope": "policy",
          "max": 500000,
          "increment_from": "args.max_tokens",
          "on_deny": "Hourly token budget exhausted. Try again next hour."
        }
      ]
    }
  }
}

Field semantics, taken from the canonical reference:

FieldValuesPurpose
counterstringCounter name, such as requested_tokens
windowminute, hour, dayCalendar-aligned UTC window over which max accumulates
scopegrant, server, policy, globalWhich counter the request increments
maxintegerCap before requests are denied
incrementintegerStatic increment per call (default 1)
increment_fromargs.<field>Read increment dynamically from request args
on_denystringMessage returned to the agent when the cap fires

Scope choice matters. grant caps each grant independently — useful when one engineer’s runaway loop should not starve the rest of the team. server aggregates across every grant pointing at the same MCP server. policy aggregates across every grant attached to that policy on the server. global is deployment-wide, so use it sparingly.

The Limits primitive sits alongside Require, Deny if, and Hide. Limits are not conditional today: they apply to the tool whenever that tool call passes Require and Deny if. If you need model-specific controls, split expensive models into a separate tool or grant, or deny those model names outright with a deny_if rule:

{
  "version": "1",
  "default": "allow",
  "tools": {
    "messages": {
      "deny_if": [
        {
          "conditions": [
            { "path": "args.model", "op": "contains", "value": "opus" }
          ],
          "on_deny": "Opus is not available on this grant."
        }
      ]
    }
  }
}

Honest note: limits reset at the window boundary. They are throttles, not kill-switches. A daily cap resets at midnight UTC. If you need a hard ceiling for the month, layer a day limit under a manual review process, or use Hide to drop the tool entirely until reinstated.

Every denied limit writes a rule pointer to the proxy log feed — for the configuration above, /tools/messages/limits/requested_tokens-hour — together with the on_deny message that went back to the agent. The dashboard’s log feed surfaces both, so the engineer who got the deny knows exactly which limit fired.

Getting Started

Three steps, assuming the dashboard is already running.

  1. Register the upstream LLM MCP server. In the dashboard, add the MCP server’s URL — your Anthropic-fronting MCP, OpenRouter MCP, or internal inference gateway. PolicyLayer issues a server UUID and exposes the proxy at /mcp/<server-uuid>/. Issue one labelled Grant per developer or automation; each client points at the proxy URL with its grant bearer token.

  2. Configure the policy. In the Policy Editor, create a policy for the server and add a Limits rule on the tool that consumes tokens — for an Anthropic-fronting MCP that’s usually messages or messages.create. Set counter, window: "hour", choose a scope, and use increment_from: "args.max_tokens" if that integer field exists in the tool schema. Attach the policy to the Grant that should enforce it.

  3. Test by exceeding the cap. Drop the max to something small — say five thousand tokens — and ask the agent to run a few long generations. Watch the proxy log feed. The first calls succeed; once cumulative max_tokens crosses the cap, subsequent calls are denied with your on_deny message and the rule pointer logged. Raise max to your production budget and the same rule applies under real traffic.

If the LLM MCP server exposes tools you do not want any agent calling at all — image generation, embeddings on a separate billing line — use Hide to strip them from the tools/list handshake. Whole tools only; * hides everything.

Why This Matters

Requested generation size becomes bounded by configuration, not by the agent’s restraint. One policy, attached to every relevant grant on the server, controls the shared budget instead of relying on per-developer provider-console settings that drift out of sync. Denials surface in the proxy log feed alongside every other policy decision, so the same dashboard that shows blocked write operations also shows exhausted token budgets — one signal, one timeline, one place to investigate when a teammate reports their agent stopped working.

Let agents act without letting them run wild.

Deterministic policy on every MCP tool call. Per-identity grants. Full audit log.

// GET IN TOUCH

Have a question or want to learn more? Send us a message.

Message sent.

We'll get back to you soon.