What is Pay-Per-Inference?

2 min read Updated

Pay-per-inference is a pricing model where each AI model inference request (an LLM completion, image generation, embedding, or classification) is paid for individually at the time of the request, rather than through prepaid credits or subscription tiers.

WHY IT MATTERS

Current AI API pricing — OpenAI, Anthropic, Google — uses account-based billing: sign up, add a credit card, prepay credits, track usage against quotas. This creates friction for AI agents that need to dynamically select models based on task requirements.

Pay-per-inference using x402 eliminates this friction entirely. An agent that needs GPT-4-class reasoning for one task and a cheap embedding model for another can pay each provider per-request without maintaining accounts with either. The agent evaluates quality, latency, and price in real-time and routes to the best option.

The x402 exact scheme handles fixed-price inference (e.g. $0.01 per request). The planned upto scheme is even more powerful — the agent authorises up to a maximum amount, and the server settles based on actual tokens generated. This mirrors how LLM pricing actually works (variable output length) but executes at the protocol level.

Cloudflare's x402 playground demonstrates this pattern with paid MCP tools — an agent calls a paid tool, receives a 402 challenge, pays, and gets the result. The transaction can require human confirmation or execute autonomously, depending on configuration.

Pay-per-inference creates a true competitive market for AI services. Providers compete on price and quality for every single request, not just at subscription renewal time.

HOW POLICYLAYER USES THIS

PolicyLayer enforces per-inference spending caps and daily AI spending budgets. When agents autonomously select and pay for inference providers, per-request limits prevent overpayment while aggregate limits cap total AI spending across all providers.

FREQUENTLY ASKED QUESTIONS

How does pay-per-inference differ from OpenAI's current pricing?
OpenAI charges per token via account-based billing with API keys. Pay-per-inference via x402 charges per request at the HTTP level with no account required. The agent pays with stablecoins per request — no sign-up, no API key, no prepaid credits.
Can agents switch inference providers mid-task?
Yes, and that's a key advantage. An agent might use a cheap model for summarisation and an expensive model for complex reasoning within the same task. With x402, each request is independent — no vendor lock-in, no wasted prepaid credits.
What about variable-length outputs?
The exact scheme charges a fixed price per request regardless of output length. The planned upto scheme would allow usage-based settlement — authorise up to $0.10, pay only for the tokens actually generated. This better matches LLM economics.

FURTHER READING

Enforce policies on every tool call

Intercept is the open-source MCP proxy that enforces YAML policies on AI agent tool calls. No code changes needed.

npx -y @policylayer/intercept
github.com/policylayer/intercept →
// GET IN TOUCH

Have a question or want to learn more? Send us a message.

Message sent.

We'll get back to you soon.