TwELL Format Enables Zero-Overhead Sparsity on GPUs
Feedforward layers consume ⅔ of LLM parameters and 80%+ of FLOPs, but activation sparsity leaves 99%+ of hidden neurons at zero post-ReLU. Standard ELLPACK sparsity fails on batched GEMM (training/high-throughput inference) due to dense-to-sparse conversion overheads that match or exceed savings on Tensor Core-optimized GPUs.
TwELL fixes this by tile-wise packing: partition gate activation columns into horizontal tiles matching matmul kernel tile size T_n (e.g., CTA dimensions). Pack non-zeros + indices locally per tile in ELL-style within the gate projection epilogue—no extra kernel, global read, or sync. Compression factor C ensures T/C > max non-zeros/tile; store as single 32-bit matrix for locality.
Inference fuses up/down projections in one kernel per input row: CTAs iterate tile non-zeros, loading W_u columns and W_d rows for dot products. Hidden state h_u stays in registers, slashing DRAM. Training uses hybrid format: route low-nz rows (<threshold) to compact ELL, overflow to dense backup, handling non-uniform sparsity (max nz/row >> average).
Supports gated MLPs (Llama/Qwen) and non-gated Transformers (11.2% inference speedup at L1=2e-5).
Induce Sparsity with ReLU + L1—No Hyperparam Tweaks
Replace SiLU with ReLU in gates for exact zeros on negatives. Add L1 loss on hidden activations (post-up projection, pre-down): L1 = 2×10⁻⁵ × mean(|h| over tokens/dims/layers), summed to CE loss.
Sparsity stabilizes in ~1000 steps (~1B tokens). At L1=2e-5, nz activations drop from 911 to 29/layer (99.5% sparse) in 1.5B model (d_ff=5632); 30%+ neurons die permanently but accuracy holds (46.4% → 46.2% tasks). Test 8 L1 values: up to 3e-5, <2% relative CE rise, no task accuracy drop (ARC/HellaSwag/etc.).
Mitigate dead neurons via gate weight reinitialization: +19.1% speedup vs +17.9% baseline, same sparsity/accuracy. Train on fineweb-edu (10-40B tokens, chinchilla-optimal), ctx=2048, batch=1M—no LR/optimizer/weight decay changes.
Speedups Grow with Scale; Patterns Favor Early Layers
On 8x H100 PCIe (seq=2048):
| Model | Inf Speedup | Train Throughput | Peak Mem Δ | Energy/tok Δ | Accuracy Δ |
|---|---|---|---|---|---|
| 0.5B | +17.0% | -1.5% | -19.2% | -11.8% | 40.4→40.4% |
| 1B | +18.1% | +7.1% | -25.5% | -14.6% | 44.6→44.7% |
| 1.5B | +18.8% | +11.6% | -28.1% | -15.0% | 46.4→46.2% |
| 2B | +20.5% | +21.9% | +22.3%* | -17.0% | 49.1→48.8% |
*2B uses larger micro-batch via mem savings (46.7→57.1GB peak). Nz/layer falls 39→24 (0.5B→2B), amplifying skips. -0.996 Pearson corr: sparser layers = bigger gains.
Patterns: Layer 1-2 least active in 28L 1.5B; peak early-middle (reasoning/knowledge). Sequence pos 1 fires exponentially more neurons than later. Larger gains on RTX PRO 6000 (188 SMs): sparse thrives where dense GEMM lags.
Open-source kernels (H100 TMA/persistent CTAs; RTX verified), code for Llama/etc. Future: fine-tune dense models.