TwELL Delivers 20% LLM Speedups via GPU-Optimized Sparsity

TwELL Format Enables Zero-Overhead Sparsity on GPUs

Feedforward layers consume ⅔ of LLM parameters and 80%+ of FLOPs, but activation sparsity leaves 99%+ of hidden neurons at zero post-ReLU. Standard ELLPACK sparsity fails on batched GEMM (training/high-throughput inference) due to dense-to-sparse conversion overheads that match or exceed savings on Tensor Core-optimized GPUs.

TwELL fixes this by tile-wise packing: partition gate activation columns into horizontal tiles matching matmul kernel tile size T_n (e.g., CTA dimensions). Pack non-zeros + indices locally per tile in ELL-style within the gate projection epilogue—no extra kernel, global read, or sync. Compression factor C ensures T/C > max non-zeros/tile; store as single 32-bit matrix for locality.

Inference fuses up/down projections in one kernel per input row: CTAs iterate tile non-zeros, loading W_u columns and W_d rows for dot products. Hidden state h_u stays in registers, slashing DRAM. Training uses hybrid format: route low-nz rows (<threshold) to compact ELL, overflow to dense backup, handling non-uniform sparsity (max nz/row >> average).

Supports gated MLPs (Llama/Qwen) and non-gated Transformers (11.2% inference speedup at L1=2e-5).

Induce Sparsity with ReLU + L1—No Hyperparam Tweaks

Replace SiLU with ReLU in gates for exact zeros on negatives. Add L1 loss on hidden activations (post-up projection, pre-down): L1 = 2×10⁻⁵ × mean(|h| over tokens/dims/layers), summed to CE loss.

Sparsity stabilizes in ~1000 steps (~1B tokens). At L1=2e-5, nz activations drop from 911 to 29/layer (99.5% sparse) in 1.5B model (d_ff=5632); 30%+ neurons die permanently but accuracy holds (46.4% → 46.2% tasks). Test 8 L1 values: up to 3e-5, <2% relative CE rise, no task accuracy drop (ARC/HellaSwag/etc.).

Mitigate dead neurons via gate weight reinitialization: +19.1% speedup vs +17.9% baseline, same sparsity/accuracy. Train on fineweb-edu (10-40B tokens, chinchilla-optimal), ctx=2048, batch=1M—no LR/optimizer/weight decay changes.

Speedups Grow with Scale; Patterns Favor Early Layers

On 8x H100 PCIe (seq=2048):

Model	Inf Speedup	Train Throughput	Peak Mem Δ	Energy/tok Δ	Accuracy Δ
0.5B	+17.0%	-1.5%	-19.2%	-11.8%	40.4→40.4%
1B	+18.1%	+7.1%	-25.5%	-14.6%	44.6→44.7%
1.5B	+18.8%	+11.6%	-28.1%	-15.0%	46.4→46.2%
2B	+20.5%	+21.9%	+22.3%*	-17.0%	49.1→48.8%

*2B uses larger micro-batch via mem savings (46.7→57.1GB peak). Nz/layer falls 39→24 (0.5B→2B), amplifying skips. -0.996 Pearson corr: sparser layers = bigger gains.

Patterns: Layer 1-2 least active in 28L 1.5B; peak early-middle (reasoning/knowledge). Sequence pos 1 fires exponentially more neurons than later. Larger gains on RTX PRO 6000 (188 SMs): sparse thrives where dense GEMM lags.

Open-source kernels (H100 TMA/persistent CTAs; RTX verified), code for Llama/etc. Future: fine-tune dense models.

TwELL Format Enables Zero-Overhead Sparsity on GPUs

Induce Sparsity with ReLU + L1—No Hyperparam Tweaks

Speedups Grow with Scale; Patterns Favor Early Layers

More on Edge

AntAngelMed: 103B MoE Medical LLM Matches 40B Dense at 7x Speed

Gemma 2: Open LLMs Trained on 13T Tokens, Top Benchmarks

Qwen-Scope SAEs Unlock Actionable LLM Internals

OpenMythos: 770M RDT Matches 1.3B Transformer Power