Summaries · #deep-learning

DAY 01Yesterday MAY 12 · 20261 SUMMARIES

MarkTechPostAI & LLMsMay 12, 2026

Aurora Fixes Muon's Neuron Death in Tall MLPs

Aurora optimizer eliminates >25% neuron death in Muon's tall matrices by jointly enforcing left semi-orthogonality and uniform row norms √(n/m), delivering SOTA on nanoGPT speedrun with 6% compute overhead.

MarkTechPost

DAY 02Saturday MAY 9 · 20261 SUMMARIES

Towards AIMay 9, 2026

NVIDIA Halves DSA Top-K Time via Decode Stability

NVIDIA exploits autoregressive decoding's temporal stability—similar queries and gradually evolving scores—to cut DeepSeek Sparse Attention's Top-K bottleneck by half using Guess-Verify-Refine.

Towards AI

DAY 03Thursday MAY 7 · 20262 SUMMARIES

Towards AIData Science & VisualizationMay 7, 2026

Triple YOLO Recall with Adaptive Post-Processing

In crowded scenes, set YOLO confidence to 0.05, then filter dynamically by frame score distribution, box size (lower threshold for <5% height boxes), and pose keypoints (nose + shoulders) to detect 3x more people without retraining.

Towards AI

Towards AIMay 7, 2026

Build CLIP: 400M Images, Zero Labels via Contrastive Learning

CLIP trains vision models on 400 million scraped image-text pairs using a single contrastive objective—no manual labels needed—matching ResNet-101 zero-shot on ImageNet and powering DALL-E 2, Stable Diffusion, LLaVA.

DAY 04May 6, 2026 MAY 6 · 20262 SUMMARIES

Generative AIMay 6, 2026

Generative AI: Prediction to Creation via Scale

Generative AI shifts machines from analyzing data (traditional AI's strength) to creating new content like text or images, powered by Markov chains, deep learning, and massive datasets/compute yielding $33.9B investment in 2024.

Generative AI

Towards AIAI & LLMsMay 6, 2026

GPU Bandwidth Limits LLM Speed, Not FLOPS

Generating one token from a 70B model on H100 needs 140GB weight reads—one op per byte—making memory bandwidth the inference bottleneck, not compute throughput.

DAY 05April 28, 2026 APR 28 · 20261 SUMMARIES

Caleb Writes CodeApr 28, 2026

Diffusion: Data-Efficient Framework Outshining Autoregressives on Scarce Data

Diffusion is a training framework—not architecture—that creates extra samples by gradually noising clean data over 1,000 steps, outperforming autoregressives on 25-100M tokens where data is limited but compute abundant; lags in text due to slow inference and infrastructure.

Caleb Writes Code

DAY 06April 21, 2026 APR 21 · 20262 SUMMARIES

Towards AIApr 21, 2026

PCL: Confidence RL for Dynamic LLM Environments

PCL algorithm integrates predictive confidence scores into LLM RL rewards via ensembles and blended token/sequence signals, enabling adaptation to nonstationary changes without retraining.

Towards AI

Generative AIAI & LLMsApr 21, 2026

Sentences Define Word Meanings via Self-Attention

Transformers ended 30 years of sequential processing flaws by using self-attention, where every word weighs relevance from the entire sentence context, powering GPT and all modern LLMs.

DAY 07April 20, 2026 APR 20 · 20262 SUMMARIES

Caleb Writes CodeAI & LLMsApr 20, 2026

LLM Inference: mmap Loading & Quantization Deep Dive

Efficient LLM inference hinges on mmap for lazy memory loading (e.g., <10s startup on llama.cpp) and quantization like GGUF K-Quants or AWQ/EXL2 to shrink 15GB models while preserving quality via salient weights and mixed precision.

Caleb Writes Code

Level Up CodingData Science & VisualizationApr 20, 2026

Preprocessing Swings CNN Accuracy from 65% to 87% on CIFAR-10

Raw CIFAR-10 pixels yield 65% test accuracy; normalization/standardization lift to 69%; geometric augmentation maintains ~67%; photometric brightness/contrast crashes to 20%; combined pipeline with deeper CNN hits 87%.

DAY 08April 17, 2026 APR 17 · 20261 SUMMARIES

AI Simplified in Plain EnglishAI & LLMsApr 17, 2026

53x AI Efficiency via Model Distillation by 2025

Train small 'student' models on large 'teacher' models' soft probabilities—not just labels—to match performance while slashing size, speed, and costs by 53x by 2025.

AI Simplified in Plain English

DAY 09April 16, 2026 APR 16 · 20261 SUMMARIES

MarkTechPostAI & LLMsApr 16, 2026

Parcae Stabilizes Loops to Match 2x Transformer Quality

Parcae enforces looped transformer stability via negative diagonal matrices in a dynamical system, outperforming baselines and achieving 87.5% of a twice-sized Transformer's quality at half parameters.

MarkTechPost

DAY 10April 13, 2026 APR 13 · 20261 SUMMARIES

MarkTechPostData Science & VisualizationApr 13, 2026

Build FNO & PINN Surrogates for Darcy Flow with PhysicsNeMo

Step-by-step Colab guide: generate 2D Darcy datasets via GRF & finite differences, implement/train FNO operators and PINNs, add CNN baselines, benchmark inference speeds for fast physics surrogates.

MarkTechPost

DAY 11April 8, 2026 APR 8 · 20268 SUMMARIES

Towards AIApr 8, 2026

Word2Vec: Turning Word Neighborhoods into Embeddings

Word2Vec learns dense word vectors by predicting local contexts with CBOW or Skip-gram, clustering similar words like 'cat' and 'dog' via repeated gradient updates from shared neighborhoods.

Towards AI

Andrej Karpathy GistsSoftware EngineeringApr 8, 2026

Batch GEMMs for Fast LSTM in Torch

Fuse LSTM operations into nngraph module to batch 4 GEMMs, slashing overhead vs standard nn.LSTM (optimized by @jcjohnson).

Andrej Karpathy GistsSoftware EngineeringApr 8, 2026

Batched L2 Norm Layer for Torch Neural Nets

Custom Torch nn.Module normalizes each row of n x d input tensor to unit L2 norm, with efficient batched forward/backward passes for training.

Andrej Karpathy GistsData Science & VisualizationApr 8, 2026

Minimal NumPy RNN for Char-Level Text Gen

Build a vanilla RNN language model from scratch in ~170 lines of NumPy: processes text chunks of 25 chars, trains with BPTT and Adagrad, generates samples after 100 iterations.

Andrej Karpathy GistsData Science & VisualizationApr 8, 2026

NumPy Batched LSTM Forward/Backward

Efficient pure NumPy LSTM processes batched sequences (n,b,input_size); init with Xavier + forget bias=3; verified via sequential match and numerical gradients.

Andrej Karpathy GistsSoftware EngineeringApr 8, 2026

Policy Gradients for Pong: 100-Line RL Agent

Train a 2-layer NN to play Atari Pong from raw pixels using REINFORCE policy gradients. Uses 80x80 binary diff frames, discounts rewards with gamma=0.99, standardizes advantages, RMSProp updates every 10 episodes. Converges on CPU in hours.

Andrej Karpathy BlogAI & LLMsApr 8, 2026

Karpathy's Pure Python AI From Scratch

Andrej Karpathy distills neural nets, LLMs, RL, and Bitcoin into 200-500 line pure Python scripts—no deps needed—to teach core mechanics hands-on.

Learning DataApr 8, 2026

Pause Before Trust: AI Fooled My Instincts

AI generates undetectable fakes that exploit human trust shortcuts—train yourself to pause and question realistic audio, video, or text instead of believing instantly.

DAY 12April 7, 2026 APR 7 · 20261 SUMMARIES

Reinike AIApr 7, 2026

TurboQuant: 6x KV Cache Compression Without Attention Loss

TurboQuant rotates KV vectors before quantizing to 3.5 bits/channel (quality-neutral) or 2.5 bits (minor degradation), plus error repair, yielding 6x memory savings and up to 8x speedups for long-context LLMs.

Reinike AI