BPE Tokenizer

Byte Pair Encoding: iteratively merge the most frequent byte pairs into new tokens

Token IDs

Click Step or Play to start

Pair Frequencies

Merge Table

No merges yet

Phase

Merges Done

Token Count

Compression

Click "Step" or "Play" to start the visualization

Best Pair

Merged Token

Pair Highlight

                    Token count
                    -
                
                    Tokens will appear here after processing
                
                    Token IDs will appear here
                
Python Code
                    # --- TRAIN ---
                
                    ids = list(text.encode("utf-8"))
                
                    for i in range(n_merges):
                
                        stats = get_pair_stats(ids)
                
                        best = max(stats, key=stats.get)
                
                        idx = 256 + i
                
                        ids = merge_pair(ids, best, idx)
                
                        merges[best] = idx
                
                    # --- ENCODE ---
                
                    ids = list(text.encode("utf-8"))
                
                    while len(ids) >= 2:
                
                        pair = min(stats, key=merges.get)
                
                        if pair not in merges: break
                
                        ids = merge_pair(ids, pair, merges[pair])

Complexity Analysis

Time Complexity O(m · n) Space Complexity O(V)

Time Complexity: O(m · n)

Training Phase

m = vocab_size - 256 merge iterations
Each iteration scans the entire ID sequence of length n to count pairs: O(n)
Finding the max-frequency pair: O(n) over unique pairs
Merging all occurrences of the best pair: O(n) single pass

T_train(m, n) = m × O(n) = O(m · n)

Encoding Phase

At most m merge operations applied sequentially
Each merge scans the current ID sequence: O(n)
The sequence shrinks with each merge, but worst case remains O(n)

T_encode(m, n) = m × O(n) = O(m · n)

Key Insight

BPE's greedy merge strategy ensures that each iteration reduces the sequence length, making later iterations faster in practice. The total work across all iterations is often closer to O(n log n) than the worst-case O(m · n).

Space Complexity: O(V)

Breakdown

Vocabulary table: V = vocab_size entries mapping ID → byte sequence: O(V)
Merge table: m = V - 256 entries mapping (pair) → new ID: O(m)
Token ID sequence: O(n), shrinks as merges are applied
Pair statistics dictionary: O(n) unique pairs at most

S(V, n) = O(V + n) = O(V) when V ≥ n

Mathematical Intuition

The vocabulary grows by exactly one entry per merge iteration (from 256 base bytes to vocab_size). Each vocabulary entry stores the concatenation of two existing entries' byte sequences, so the total byte storage across all entries is bounded by O(V · max_token_length).

Key Insight

BPE achieves compression by trading space (a larger vocabulary) for shorter sequences. The merge table is the "dictionary" that enables this trade-off — it compactly encodes how to decompose any token back to raw bytes.