What is Inference?
Inference is the process of running a trained AI model on new inputs to generate outputs — the production phase where models serve real requests, as opposed to training where models learn.
WHY IT MATTERS
Training is learning; inference is doing. When you send a prompt to an LLM and get a response, that's inference. The model applies its learned patterns to your specific input.
Inference is where AI economics play out. Training happens once at enormous cost; inference happens millions of times. Optimizing inference through quantization, caching, and batching directly impacts cost and latency.
For AI agents, inference latency matters. A financial agent that takes 30 seconds to decide on a trade might miss the opportunity. Speculative decoding and model distillation help reduce time.