LLM Distillation: Soft, Hard, and Co Techniques Explained

Inherit Advanced Capabilities from Giant Teachers at Low Cost

LLM distillation trains smaller "student" models using outputs from powerful "teacher" LLMs, bypassing raw text training to transfer reasoning, instruction-following, and structured generation. This cuts inference costs while preserving performance—Meta distilled Llama 4 Behemoth into Scout and Maverick; Google used Gemini for Gemma 2/3; DeepSeek transferred reasoning from DeepSeek-R1 to Qwen and Llama-based models. Apply during pre-training (joint) or post-training (teacher-fixed), enabling deployment of high-performing models on limited hardware.

Soft-Label Distillation Unlocks Richer Signals but Demands Resources

Train students to replicate the teacher's full softmax probability distribution over the vocabulary, not just the top token. Example: Teacher assigns "cat" 70%, "dog" 20%, "animal" 10%—student learns token relationships and uncertainty, capturing "dark knowledge" of reasoning patterns. This yields more stable training and superior inheritance of semantic understanding versus hard labels alone. Trade-offs: Requires teacher logits/weights (impossible for closed models like GPT-4), and storing distributions for 100k+ vocab tokens explodes memory on trillion-token datasets, limiting scalability.

Hard-Label Distillation Prioritizes Practicality with Black-Box Access

Student mimics teacher's final generated tokens via standard supervised learning, treating teacher as a synthetic data annotator. DeepSeek used this to instill reasoning in smaller Qwen/Llama 3.1 models. Advantages: Far cheaper (no probability storage), works with API-only black-box teachers (e.g., GPT-4 text outputs). Effective for instruction tuning, synthetic data, and domain fine-tuning, though it skips internal confidence/relationships, providing less nuanced transfer than soft labels.

Co-Distillation Enables Collaborative Gains Over One-Way Transfer

Train teacher and student simultaneously on shared data: teacher uses ground-truth hard labels; student matches teacher's evolving soft labels plus hard loss for stability. Meta applied this for Llama 4 family. Benefits: Mutual improvement narrows teacher-student gaps, enhances reasoning transfer. Mitigates early noisy teacher predictions via hybrid losses. Drawback: Added complexity from non-fixed teacher. Use soft for max transfer (open models), hard for ease/scalability, co for large joint setups.

Inherit Advanced Capabilities from Giant Teachers at Low Cost

Soft-Label Distillation Unlocks Richer Signals but Demands Resources

Hard-Label Distillation Prioritizes Practicality with Black-Box Access

Co-Distillation Enables Collaborative Gains Over One-Way Transfer

More on Edge

SFT + RL Recovers Sandbagged AI Capabilities Using Weak Supervisors

GPUs Crush AI Tasks with Parallel Compute and Vast Memory

PrfaaS Enables Cross-Datacenter LLM Serving with 54% Throughput Gain

Mistral-7B-v0.3 Reaches 86.5% Text-to-SQL via Logic Normalization