Toward a Theory of Tokenization in LLMs

· June 10, 2025

links-and-recs

[2404.08335] Toward a Theory of Tokenization in LLMs

Tokenization has always struck me as one of the odder aspects of natural language deep learning. Despite the extensive end-to-end learning processes we typically use, tokenization initially involves creating a dictionary of optimal sub-word segments from your dataset. One of the appealing concepts in the Byte Latent Transformers paper is the potential to learn tokenization dynamically, recognizing that tokenizers solve deeper problems than merely providing a fixed vocabulary.

This paper addresses tokenization from a theoretical perspective by modeling sequences using kth-order Markov processes, where the likelihood of each token depends on the preceding sequence, as in natural language. The parameter k corresponds to the model’s context window size. Key findings include:

Training without tokenization leads models to effectively behave as unigram predictors, significantly limiting performance.
Using a well-designed tokenizer (e.g., Byte Pair Encoding – BPE) enables models to achieve nearly optimal performance in capturing sequence dependencies.
Increasing the tokenizer’s dictionary size improves the model’s performance, moving it closer to the ideal probability distribution.

Tokenizers which do a good job at learning patterns in the data and assigning these frequent patterns as tokens in the dictionary are compatible with an i.i.d. model over tokens.

This insight suggests that despite the complexity of natural language’, a good tokenizer converts sequences into something approximating an independent and identically distributed (i.i.d.) format, which brings the modeling tasks for transformers closer to the one they can solve.

While the paper does not explicitly explore the Byte Latent approach, I wonder if its entropy-driven dynamic token allocation might similarly achieve this i.i.d. simplification. In BLT the entropy model, trained separately, could be dynamically transform inputs into a distribution that is more palatable for transformers.