BLT's Memory Bandwidth Bottleneck in Byte-Level Generation
Byte-level models like BLT avoid tokenization pitfalls—noise sensitivity, poor multilingual support, weak character/code handling—by processing raw bytes via entropy-based patches (avg 4 bytes, max 8). Computation uses local encoder, global Transformer, local decoder on latent tokens. Inference slows because autoregressive decoder generates one byte/step, vs. tokens covering multiple bytes. This multiplies memory loads for weights/KV caches, the key serving bottleneck. BLT needs 4x more decoder passes than token models for equivalent text, hiking bandwidth costs.
Block Diffusion Enables Multi-Byte Decoding per Pass (BLT-D)
BLT-D replaces byte-by-byte autoregression with discrete diffusion in fixed blocks (B=4/8/16 bytes). Training: corrupt blocks by masking bytes independently with prob t~U(0,1); loss combines next-byte prediction on clean seq + masked prediction on corrupted. Inference: start with MASK block, iteratively unmask multiple bytes/pass via confidence (prob>α) or entropy-bounded (cumulative entropy<γ) sampling. Encoder/global called once/block, not per-patch; supports KV caching.
At 3B params on BLT-1T (1T tokens), BLT-D-4 matches BLT scores on FLORES-101 translation (French/English, German/English; 4-shot BLEU), nears on HumanEval/MBPP coding (0/3-shot pass@1). BLT-D-16 cuts bandwidth 87-92% but drops coding pass@1. Likelihoods (ARC-Easy/Challenge, PIQA, HellaSwag, MMLU) near baseline via causal-masked decoder. Translation gains most; coding sensitive to block size. Entropy-bounded + top-p boosts diversity (higher type-token ratio) as NFEs rise.
No-Training Speculation Recycles Existing Decoder (BLT-S, BLT-DV)
BLT-S uses lightweight decoder as self-drafter: generate k=8/16 bytes ignoring patch boundaries, conditioning on last latent; verify via full encode/global/decode, accept to first mismatch. Greedy decoding guarantees identical output to BLT (no quality loss); reduces encoder/global calls despite more decoder passes. At 3B/k=16, 77% bandwidth cut.
BLT-DV (on BLT-D weights): one-step diffusion drafts block, autoregressive verify accepts to mismatch. Single-step diffusion degrades alone but verification fixes it. At 3B, up to 81% bandwidth reduction.
All trained 1B:240k steps, 3B:480k on BLT-1T (public + Datacomp-LM subset). Efficiency proxies: decoder/encoder NFEs, GB bandwidth (16-bit, param/forward counts). Wall-clock needs optimized serving.
Practical Tradeoffs for Production Deployment
BLT-D fastest (esp B=16) but coding tradeoffs; BLT-S zero-loss safest. All preserve autoregressive likelihoods/reasoning. Bandwidth proxies predict real gains in memory-bound serving. Future: optimized inference impl. Byte-level now viable for production-scale speed without tokenizer fragility.