Tokens & tokenization
Why use tokens instead of whole words?
Whole-word vocabularies break on rare words, new words, and other languages. Subword tokens let a model spell out anything from familiar pieces, so it can handle words it has never seen before.
Why not just give the model a dictionary and treat each word as one unit? Because human language has far too many words, and new ones appear all the time: names, slang, product codes, typos, and other languages. A fixed word list would always be missing something, and any missing word would have to be replaced with a generic “unknown” placeholder, losing its meaning.
Subword tokens avoid this trap. Because the vocabulary includes short pieces, a model can spell out a word it has never seen from familiar parts. Anthropic describes the trade-off directly: “Larger tokens enable data efficiency… while smaller tokens allow a model to handle uncommon or never-before-seen words.”[2] Common words stay whole and efficient; unusual ones are assembled from their pieces.
Many models push this further with byte-level BPE, which treats text as raw bytes rather than letters. That keeps a small base vocabulary while “ensuring every possible character can be represented without converting to unknown tokens.”[1] The payoff is coverage: with a manageable vocabulary, the model can represent essentially any string, in any language or script, without ever hitting a wall.
References
- Byte-Pair Encoding tokenization — Hugging Face
- Glossary — Anthropic