What is Pay-Per-Inference?
Pay-per-inference is a pricing model where each AI model inference request (an LLM completion, image generation, embedding, or classification) is paid for individually at the time of the request, rather than through prepaid credits or subscription tiers.
WHY IT MATTERS
Current AI API pricing — OpenAI, Anthropic, Google — uses account-based billing: sign up, add a credit card, prepay credits, track usage against quotas. This creates friction for AI agents that need to dynamically select models based on task requirements.
Pay-per-inference using x402 eliminates this friction entirely. An agent that needs GPT-4-class reasoning for one task and a cheap embedding model for another can pay each provider per-request without maintaining accounts with either. The agent evaluates quality, latency, and price in real-time and routes to the best option.
The x402 exact scheme handles fixed-price inference (e.g. $0.01 per request). The planned upto scheme is even more powerful — the agent authorises up to a maximum amount, and the server settles based on actual tokens generated. This mirrors how LLM pricing actually works (variable output length) but executes at the protocol level.
Cloudflare's x402 playground demonstrates this pattern with paid MCP tools — an agent calls a paid tool, receives a 402 challenge, pays, and gets the result. The transaction can require human confirmation or execute autonomously, depending on configuration.
Pay-per-inference creates a true competitive market for AI services. Providers compete on price and quality for every single request, not just at subscription renewal time.
HOW POLICYLAYER USES THIS
PolicyLayer enforces per-inference spending caps and daily AI spending budgets. When agents autonomously select and pay for inference providers, per-request limits prevent overpayment while aggregate limits cap total AI spending across all providers.