What is Inference?

1 min read Updated

Inference is the process of running a trained AI model on new inputs to generate outputs — the production phase where models serve real requests, as opposed to training where models learn.

WHY IT MATTERS

Training is learning; inference is doing. When you send a prompt to an LLM and get a response, that's inference. The model applies its learned patterns to your specific input.

Inference is where AI economics play out. Training happens once at enormous cost; inference happens millions of times. Optimizing inference through quantization, caching, and batching directly impacts cost and latency.

For AI agents, inference latency matters. A financial agent that takes 30 seconds to decide on a trade might miss the opportunity. Speculative decoding and model distillation help reduce time.

FREQUENTLY ASKED QUESTIONS

How much does inference cost?
GPT-4 class models cost $10-30 per million tokens. Smaller models can be self-hosted for much less. For agents making many calls, inference costs are a significant expense.
What's the difference between inference and training?
Training adjusts model weights using large datasets. Inference uses fixed weights to process new inputs. Training is write; inference is read.
Can inference be done on-device?
Yes, for smaller models. Quantized 7B-13B parameter models run on modern laptops and phones. Frontier models require cloud GPUs.

FURTHER READING

Let agents act without letting them run wild.

Deterministic policy on every MCP tool call. Per-identity grants. Full audit log.

Currently onboarding teams running MCP in production.
// GET IN TOUCH

Have a question or want to learn more? Send us a message.

Message sent.

We'll get back to you soon.

// REQUEST EARLY ACCESS

We're letting people in as fast as we can.

You're in the queue.

We'll be in touch as soon as we can let you in.