Byte-Latent Transformers

· February 18, 2025

(Updated: Apr 22, 2026)

modelling

Who needs a tokenizer anyway!

[2412.09871] Byte Latent Transformer: Patches Scale Better Than Tokens

This paper, from back in December last year, presents an interesting approach to handling raw byte sequences in LLMs without relying on tokenization.

Vocab sizes for tokenizers have gone up over the last couple of years with attendant gains in usefulness, but this remains a particularly hand-tuned number in the training process. BLT proposes a method that processes raw UTF-8 byte sequences directly, leveraging a dynamic patching mechanism to group bytes into variable-length patches based on entropy.

Higher-entropy regions receive more attention and shorter patches, while lower-entropy regions can be processed more efficiently.

There are conceptually three levels of processing:

Local Encoder: A small transformer stack encodes raw byte sequences into higher level representations, which are then structured into patches.
Latent Global Transformer: A standard large transformer model operating on patch-level representations
Local Decoder: The encoded patches are decoded back into byte sequences, using a cross-attention mechanism to reconstruct text.

In the paper they show they can achieve parity in pretraining with a traditional tokenized approach in llama for similar parameter count, while being more robust and offering some inference time performance gains. The patching approach allows for allocating compute where needed most.

Retrofitting existing models

One of the ideas I found most interesting is starting with a traditionally pretrained model. The paper discusses using the main transformer layers from Llama and training the byte latent approach successfully.

I gave the approach a go with a simplified local encoder, entropy and patching approach, and took the transformer layers from Qwen 2.5 3B, a strong model that could still be trained locally (no corporate resources were harmed, etc).

The basic approach was replacing the tokenizer, adding a small transformer and patch pooling based on a local entropy measure to generate patches, then cross-attending in some of the Qwen layers. Its training a new encoder while leveraging Qwen for the backbone of the global transformer and adding new cross-attention params to make it also the decoder, with the embedding layers at each end chopped off – so a significant domain shift. For inference I leverage the same patch generation process to try and generate effective tokens.

You can find my Torchtune recipe on GitHub, running through the Alpaca dataset. Thus far I’m still training so while loss is improving, I have no idea whether it will turn into something useful. The fact that there is something trainable is fun though, and I have hopes that this kind of technique will lead to some breakthroughs in tokenizer-free models in the future!