Language models never see letters or words — they see tokens. Before your prompt reaches the model, a tokenizer chops it into chunks drawn from a fixed vocabulary and replaces each chunk with a number (a token ID). The demo on this page runs cl100k_base, the actual byte-pair-encoding (BPE) tokenizer used by GPT-4, directly in your browser: everything you type is tokenized live, and you can even edit the raw token IDs and watch them decode back into text.
Common words like “the” get a single token, while rarer words are split into sub-word pieces. That is why “strawberry” famously breaks into multiple tokens — and why language models struggle to count the letter “r” in it: the model never sees individual letters at all.
Tokens are the unit of everything in modern AI: API pricing is per token, context windows are measured in tokens, and generation speed is tokens per second. Tokenization also has an equity dimension — the same sentence in Arabic or Hindi can cost several times more tokens than in English, because tokenizer vocabularies are trained mostly on English text. Try typing مرحبا in the demo and compare the token count to “hello.”
Now you know what AI actually sees. Join the community
howaiworks.io is free and open source (GitHub), built by Matt Feroz.