What is Tokenization (AI)?
Tokenization in AI refers to breaking text into smaller units (tokens) that a language model can process — typically subword pieces that balance vocabulary size with representation efficiency.
WHY IT MATTERS
Before an LLM can process text, it must be tokenized — converted from characters into numerical token IDs. Most models use subword tokenization (BPE or SentencePiece), splitting text into common subword units.
Tokenization affects cost (you pay per token), context window usage, and model capabilities. Different tokenizers handle numbers, code, and non-English text differently.
Not to be confused with crypto tokenization (creating digital tokens on a blockchain), AI tokenization is a technical detail that developers should understand for cost optimization.