Byte Pair Encoding: iteratively merge the most frequent byte pairs into new tokens
m = vocab_size - 256 merge iterationsBPE's greedy merge strategy ensures that each iteration reduces the sequence length, making later iterations faster in practice. The total work across all iterations is often closer to O(n log n) than the worst-case O(m · n).
The vocabulary grows by exactly one entry per merge iteration (from 256 base bytes to vocab_size). Each vocabulary entry stores the concatenation of two existing entries' byte sequences, so the total byte storage across all entries is bounded by O(V · max_token_length).
BPE achieves compression by trading space (a larger vocabulary) for shorter sequences. The merge table is the "dictionary" that enables this trade-off — it compactly encodes how to decompose any token back to raw bytes.