How text becomes tokens

Most models use byte pair encoding (BPE). It starts from single characters and repeatedly merges the most frequent neighboring pairs into longer pieces, building a fixed vocabulary. The result is reversible and works on any text.

The most common method for splitting text is byte pair encoding, or BPE. The idea is older than modern AI; it began as a way to compress data. Training a BPE tokenizer starts from a base vocabulary of the individual characters in a body of text. The algorithm then “iteratively expands this vocabulary by identifying and merging the most frequent consecutive token pairs until reaching a target vocabulary size.”^[1]

Each merge is a rule. Early on the rules join two characters; as training continues, the merged pieces grow into longer subwords.^[1] Common strings like “the” or “ing” end up as single tokens because they appear so often. Rare strings stay broken into smaller pieces. Once the vocabulary is fixed, new text is tokenized by applying those same learned merge rules in order.^[1]

Two properties make this practical. BPE “is reversible and lossless, so you can convert tokens back into the original text,” and it “works on arbitrary text, even text that is not in the tokeniser’s training data.”^[2] It also “compresses the text: the token sequence is shorter than the bytes corresponding to the original text.”^[2] That compression is part of why token counts run lower than character counts.

References

Byte-Pair Encoding tokenization — Hugging Face
tiktoken README — OpenAI

How is text split into tokens?

References