Tokens & tokenization
What is a token?
A token is the small unit of text a language model reads and writes. It can be a whole word, part of a word, a single character, or even a byte, and your text is converted into tokens before the model sees it.
A language model does not read text the way you do. Before it sees your sentence, the text is chopped into small pieces called tokens. A token is the basic unit the model works with. As Anthropic’s documentation puts it, tokens “are the smallest individual units of a language model, and can correspond to words, subwords, characters, or even bytes.”[1]
So a token is sometimes a whole word, but often it is a fragment. The word “cat” might be one token, while “tokenization” might be split into several. Spaces and punctuation count too. A rough guide for English is that one token stands for about 3.5 characters, though the exact number changes with the language being used.[1]
You normally never see tokens. They are hidden when you type a prompt and read a reply. They become “relevant when examining the exact inputs and outputs of a language model,”[1] which is to say when you care about how long your text is, how much it costs, or how much will fit. The next pages cover how text gets turned into tokens, and why the model uses these pieces instead of plain words.
References
- Glossary — Anthropic