Inherit Advanced Capabilities from Giant Teachers at Low Cost
LLM distillation trains smaller "student" models using outputs from powerful "teacher" LLMs, bypassing raw text training to transfer reasoning, instruction-following, and structured generation. This cuts inference costs while preserving performance—Meta distilled Llama 4 Behemoth into Scout and Maverick; Google used Gemini for Gemma 2/3; DeepSeek transferred reasoning from DeepSeek-R1 to Qwen and Llama-based models. Apply during pre-training (joint) or post-training (teacher-fixed), enabling deployment of high-performing models on limited hardware.
Soft-Label Distillation Unlocks Richer Signals but Demands Resources
Train students to replicate the teacher's full softmax probability distribution over the vocabulary, not just the top token. Example: Teacher assigns "cat" 70%, "dog" 20%, "animal" 10%—student learns token relationships and uncertainty, capturing "dark knowledge" of reasoning patterns. This yields more stable training and superior inheritance of semantic understanding versus hard labels alone. Trade-offs: Requires teacher logits/weights (impossible for closed models like GPT-4), and storing distributions for 100k+ vocab tokens explodes memory on trillion-token datasets, limiting scalability.
Hard-Label Distillation Prioritizes Practicality with Black-Box Access
Student mimics teacher's final generated tokens via standard supervised learning, treating teacher as a synthetic data annotator. DeepSeek used this to instill reasoning in smaller Qwen/Llama 3.1 models. Advantages: Far cheaper (no probability storage), works with API-only black-box teachers (e.g., GPT-4 text outputs). Effective for instruction tuning, synthetic data, and domain fine-tuning, though it skips internal confidence/relationships, providing less nuanced transfer than soft labels.
Co-Distillation Enables Collaborative Gains Over One-Way Transfer
Train teacher and student simultaneously on shared data: teacher uses ground-truth hard labels; student matches teacher's evolving soft labels plus hard loss for stability. Meta applied this for Llama 4 family. Benefits: Mutual improvement narrows teacher-student gaps, enhances reasoning transfer. Mitigates early noisy teacher predictions via hybrid losses. Drawback: Added complexity from non-fixed teacher. Use soft for max transfer (open models), hard for ease/scalability, co for large joint setups.