// GLOSSARY -- AGENTIC AI

What is Inference?

1 min read Updated Feb 19, 2026

Inference is the process of running a trained AI model on new inputs to generate outputs — the production phase where models serve real requests, as opposed to training where models learn.

WHY IT MATTERS

Training is learning; inference is doing. When you send a prompt to an LLM and get a response, that's inference. The model applies its learned patterns to your specific input.

Inference is where AI economics play out. Training happens once at enormous cost; inference happens millions of times. Optimizing inference through quantization, caching, and batching directly impacts cost and latency.

For AI agents, inference latency matters. A financial agent that takes 30 seconds to decide on a trade might miss the opportunity. Speculative decoding and model distillation help reduce time.

FREQUENTLY ASKED QUESTIONS

How much does inference cost?

GPT-4 class models cost $10-30 per million tokens. Smaller models can be self-hosted for much less. For agents making many calls, inference costs are a significant expense.

What's the difference between inference and training?

Training adjusts model weights using large datasets. Inference uses fixed weights to process new inputs. Training is write; inference is read.

Can inference be done on-device?

Yes, for smaller models. Quantized 7B-13B parameter models run on modern laptops and phones. Frontier models require cloud GPUs.

What is Inference?

WHY IT MATTERS

FREQUENTLY ASKED QUESTIONS

FURTHER READING

Take your agents live. Without losing control.

What is Inference?

WHY IT MATTERS

FREQUENTLY ASKED QUESTIONS

RELATED TERMS

FURTHER READING

Take your agents live. Without losing control.