#deep-learning
Every summary, chronological. Filter by category, tag, or source from the rail.
Aurora Fixes Muon's Neuron Death in Tall MLPs
Aurora optimizer eliminates >25% neuron death in Muon's tall matrices by jointly enforcing left semi-orthogonality and uniform row norms √(n/m), delivering SOTA on nanoGPT speedrun with 6% compute overhead.
NVIDIA Halves DSA Top-K Time via Decode Stability
NVIDIA exploits autoregressive decoding's temporal stability—similar queries and gradually evolving scores—to cut DeepSeek Sparse Attention's Top-K bottleneck by half using Guess-Verify-Refine.
Triple YOLO Recall with Adaptive Post-Processing
In crowded scenes, set YOLO confidence to 0.05, then filter dynamically by frame score distribution, box size (lower threshold for <5% height boxes), and pose keypoints (nose + shoulders) to detect 3x more people without retraining.
Build CLIP: 400M Images, Zero Labels via Contrastive Learning
CLIP trains vision models on 400 million scraped image-text pairs using a single contrastive objective—no manual labels needed—matching ResNet-101 zero-shot on ImageNet and powering DALL-E 2, Stable Diffusion, LLaVA.
Generative AI: Prediction to Creation via Scale
Generative AI shifts machines from analyzing data (traditional AI's strength) to creating new content like text or images, powered by Markov chains, deep learning, and massive datasets/compute yielding $33.9B investment in 2024.
GPU Bandwidth Limits LLM Speed, Not FLOPS
Generating one token from a 70B model on H100 needs 140GB weight reads—one op per byte—making memory bandwidth the inference bottleneck, not compute throughput.
Diffusion: Data-Efficient Framework Outshining Autoregressives on Scarce Data
Diffusion is a training framework—not architecture—that creates extra samples by gradually noising clean data over 1,000 steps, outperforming autoregressives on 25-100M tokens where data is limited but compute abundant; lags in text due to slow inference and infrastructure.
Caleb Writes CodePCL: Confidence RL for Dynamic LLM Environments
PCL algorithm integrates predictive confidence scores into LLM RL rewards via ensembles and blended token/sequence signals, enabling adaptation to nonstationary changes without retraining.
Sentences Define Word Meanings via Self-Attention
Transformers ended 30 years of sequential processing flaws by using self-attention, where every word weighs relevance from the entire sentence context, powering GPT and all modern LLMs.
LLM Inference: mmap Loading & Quantization Deep Dive
Efficient LLM inference hinges on mmap for lazy memory loading (e.g., <10s startup on llama.cpp) and quantization like GGUF K-Quants or AWQ/EXL2 to shrink 15GB models while preserving quality via salient weights and mixed precision.
Caleb Writes CodePreprocessing Swings CNN Accuracy from 65% to 87% on CIFAR-10
Raw CIFAR-10 pixels yield 65% test accuracy; normalization/standardization lift to 69%; geometric augmentation maintains ~67%; photometric brightness/contrast crashes to 20%; combined pipeline with deeper CNN hits 87%.
53x AI Efficiency via Model Distillation by 2025
Train small 'student' models on large 'teacher' models' soft probabilities—not just labels—to match performance while slashing size, speed, and costs by 53x by 2025.
Parcae Stabilizes Loops to Match 2x Transformer Quality
Parcae enforces looped transformer stability via negative diagonal matrices in a dynamical system, outperforming baselines and achieving 87.5% of a twice-sized Transformer's quality at half parameters.
Build FNO & PINN Surrogates for Darcy Flow with PhysicsNeMo
Step-by-step Colab guide: generate 2D Darcy datasets via GRF & finite differences, implement/train FNO operators and PINNs, add CNN baselines, benchmark inference speeds for fast physics surrogates.
Word2Vec: Turning Word Neighborhoods into Embeddings
Word2Vec learns dense word vectors by predicting local contexts with CBOW or Skip-gram, clustering similar words like 'cat' and 'dog' via repeated gradient updates from shared neighborhoods.
Batch GEMMs for Fast LSTM in Torch
Fuse LSTM operations into nngraph module to batch 4 GEMMs, slashing overhead vs standard nn.LSTM (optimized by @jcjohnson).
Batched L2 Norm Layer for Torch Neural Nets
Custom Torch nn.Module normalizes each row of n x d input tensor to unit L2 norm, with efficient batched forward/backward passes for training.
Minimal NumPy RNN for Char-Level Text Gen
Build a vanilla RNN language model from scratch in ~170 lines of NumPy: processes text chunks of 25 chars, trains with BPTT and Adagrad, generates samples after 100 iterations.
NumPy Batched LSTM Forward/Backward
Efficient pure NumPy LSTM processes batched sequences (n,b,input_size); init with Xavier + forget bias=3; verified via sequential match and numerical gradients.
Policy Gradients for Pong: 100-Line RL Agent
Train a 2-layer NN to play Atari Pong from raw pixels using REINFORCE policy gradients. Uses 80x80 binary diff frames, discounts rewards with gamma=0.99, standardizes advantages, RMSProp updates every 10 episodes. Converges on CPU in hours.
Karpathy's Pure Python AI From Scratch
Andrej Karpathy distills neural nets, LLMs, RL, and Bitcoin into 200-500 line pure Python scripts—no deps needed—to teach core mechanics hands-on.
Pause Before Trust: AI Fooled My Instincts
AI generates undetectable fakes that exploit human trust shortcuts—train yourself to pause and question realistic audio, video, or text instead of believing instantly.
TurboQuant: 6x KV Cache Compression Without Attention Loss
TurboQuant rotates KV vectors before quantizing to 3.5 bits/channel (quality-neutral) or 2.5 bits (minor degradation), plus error repair, yielding 6x memory savings and up to 8x speedups for long-context LLMs.
Reinike AIShowing 28 of 28