Inference is the process of running a trained AI model on new inputs to generate outputs — the production phase where models serve real requests, as opposed to training where models learn.
WHY IT MATTERS
Training is learning; inference is doing. When you send a prompt to an LLM and get a response, that's inference. The model applies its learned patterns to your specific input.
Inference is where AI economics play out. Training happens once at enormous cost; inference happens millions of times. Optimizing inference through quantization, caching, and batching directly impacts cost and latency.
For AI agents, inference latency matters. A financial agent that takes 30 seconds to decide on a trade might miss the opportunity. Speculative decoding and model distillation help reduce time.
Running agents against MCP servers? Route them through PolicyLayer and every tool call is checked against policy first.
Enforced before the call runs. Nothing to install.
FREQUENTLY ASKED QUESTIONS
How much does inference cost?
GPT-4 class models cost $10-30 per million tokens. Smaller models can be self-hosted for much less. For agents making many calls, inference costs are a significant expense.
What's the difference between inference and training?
Training adjusts model weights using large datasets. Inference uses fixed weights to process new inputs. Training is write; inference is read.
Can inference be done on-device?
Yes, for smaller models. Quantized 7B-13B parameter models run on modern laptops and phones. Frontier models require cloud GPUs.
Route your MCP traffic through PolicyLayer. Every tool call is checked against your policy before it runs: allow, deny, or require approval. Per-identity grants. Full audit log. Live in minutes.