What is Inference?

1 min read Updated

Inference is the process of running a trained AI model on new inputs to generate outputs — the production phase where models serve real requests, as opposed to training where models learn.

WHY IT MATTERS

Training is learning; inference is doing. When you send a prompt to an LLM and get a response, that's inference. The model applies its learned patterns to your specific input.

Inference is where AI economics play out. Training happens once at enormous cost; inference happens millions of times. Optimizing inference through quantization, caching, and batching directly impacts cost and latency.

For AI agents, inference latency matters. A financial agent that takes 30 seconds to decide on a trade might miss the opportunity. Speculative decoding and model distillation help reduce time.

FREQUENTLY ASKED QUESTIONS

How much does inference cost?
GPT-4 class models cost $10-30 per million tokens. Smaller models can be self-hosted for much less. For agents making many calls, inference costs are a significant expense.
What's the difference between inference and training?
Training adjusts model weights using large datasets. Inference uses fixed weights to process new inputs. Training is write; inference is read.
Can inference be done on-device?
Yes, for smaller models. Quantized 7B-13B parameter models run on modern laptops and phones. Frontier models require cloud GPUs.

FURTHER READING

Enforce policies on every tool call

Intercept is the open-source MCP proxy that enforces YAML policies on AI agent tool calls. No code changes needed.

npx -y @policylayer/intercept
github.com/policylayer/intercept →
// GET IN TOUCH

Have a question or want to learn more? Send us a message.

Message sent.

We'll get back to you soon.