#llm
Every summary, chronological. Filter by category, tag, or source from the rail.
NVIDIA's 10x Workflows with Codex on GPT-5.5
NVIDIA's 40k engineers use Codex (GPT-5.5) to autonomously build production systems in hours and run full ML research cycles, delivering 10x speedups and 20x code efficiency gains.
Parameter Golf: Creativity in Tiny ML Models
OpenAI's 16MB/10-min ML challenge drew 1,000+ participants and 2,000+ submissions, showcasing optimizations, quantization, novel architectures, and AI agents' role in accelerating research while creating review challenges.
Interaction Models: Native Real-Time Multimodal AI
Replace turn-based AI harnesses with native interaction models using 200ms micro-turns for continuous audio/video/text processing, enabling proactive visuals and simultaneous speech—outperforming GPT/Gemini on interaction benchmarks.
Modular Hybrid-Memory Agent with OpenAI Tools
Build a production-ready autonomous agent in Python using hybrid vector+BM25 memory fused by RRF (K=60), modular tool dispatch, and a self-managing loop limited to 8 tool rounds for reliable reasoning and action.
AntAngelMed: 103B MoE Medical LLM Matches 40B Dense at 7x Speed
103B-param open-source medical LLM activates only 6.1B params via 1/32 MoE, rivals 40B dense models with 7x efficiency, tops HealthBench/MedBench, runs 200+ tps on H20.
GPU-Orchestrated Multi-Agent Sustainability Intelligence Blueprint
Chelsie Czop and Mitesh Patel demo a serverless multi-agent app using Google ADK, Gemma 4 on NVIDIA RTX PRO 6000 GPUs via Cloud Run, and Milvus RAG for real-time environmental risk reports from satellite, telemetry, and policy data.
RL Industrializes GenAI Production via Feedback Loops
95% of GenAI pilots fail production because instruction tuning and prompts can't systematically integrate defects and metrics. RL does, enabling smaller/cheaper/faster models that scale to millions in token costs at Fortune 500s like AT&T.
Gemini Enables Agentic Tasks and Prompt-Based Widgets on Android
Google's Gemini on Android now automates multi-app tasks like grocery shopping from notes to cart, browses web for bookings, fills forms, dictates naturally, and generates widgets from natural language descriptions—rolling out summer 2026 on Pixel/Samsung first.
Anthropic Bolsters Claude for Legal Automation Boom
Anthropic launches legal plugins and MCP connectors for Claude to automate law firm tasks like document review and drafting, entering a market where Harvey raised $200M at $11B valuation and Legora secured $600M Series D at $5.6B valuation.
Malleable Evals: Adaptive Testing for Changing AI Agents
Static benchmarks fail self-adapting agents; use production traces for agent-curated, always-on eval suites that self-optimize toward user intent.
ChatGPT Adoption Broadens Across Demographics, Geography in 2026Q1
Q1 2026 consumer data shows ChatGPT usage growing among feminine-named users (>50% share), over-35s gaining share, emerging markets (e.g., Haiti +9 per-capita rank), and specialized work tasks like health docs.
CoCoDA: Co-Evolve DAGs to Scale Tool-Augmented Agents
CoCoDA uses a compositional code DAG to jointly evolve tool libraries and planners, enabling efficient retrieval from growing libraries and letting an 8B model match or beat a 32B teacher on GSM8K and MATH benchmarks.
Aurora Fixes Muon's Neuron Death in Tall MLPs
Aurora optimizer eliminates >25% neuron death in Muon's tall matrices by jointly enforcing left semi-orthogonality and uniform row norms √(n/m), delivering SOTA on nanoGPT speedrun with 6% compute overhead.
Daybreak: AI Agents for Proactive Vuln Patching
OpenAI's Daybreak expands Codex Security (launched March 2026) to ingest repos, build threat models, validate patches in isolation, and propose fixes with human review—reducing analysis from hours to minutes via tiered GPT-5.5 models gated by Trusted Access for Cyber.
Full-Duplex AI Responds in 0.40s Like Human Speech
Thinking Machines Lab's interaction models enable simultaneous listening and responding in AI conversations at 0.40s latency, faster than OpenAI and Google rivals.
LLM Distillation: Soft, Hard, and Co Techniques Explained
Distill large teacher LLMs into efficient students via soft-label (match probabilities for dark knowledge), hard-label (imitate outputs for cheap scalability), or co-distillation (joint training to minimize performance gaps).
BLT Cuts Inference Bandwidth 50-92% via Diffusion & Speculation
Meta/Stanford researchers accelerate Byte Latent Transformer (BLT) inference with BLT-D (diffusion decoding), BLT-S (self-speculation), and BLT-DV (diffusion+verification), reducing memory bandwidth 50-92% at 3B params while nearing baseline performance on translation/coding tasks.
HTML Replaces Markdown for Interactive AI Outputs
Prompt AI agents for single-file HTML instead of long Markdown reports to create navigable, editable, interactive artifacts that humans can actually use, review, share, and act on.
GPT-5.5 Instant Cuts Hallucinations 52.5%, Adds Personalization
GPT-5.5 Instant replaces GPT-5.3 as ChatGPT default, slashing hallucinated claims by 52.5% on high-stakes prompts like medicine/law/finance, using 30% fewer words for concise answers, and personalizing via past chats/files/Gmail with new memory controls.
Uber's OpenAI-Powered Multi-Agent AI Optimizes Earnings and Booking
Uber deploys OpenAI models via multi-agent architecture for Uber Assistant, delivering real-time driver guidance from marketplace data and voice-based ride booking, accelerating new driver ramp-up versus hundreds of trips via trial-and-error.
Singular Bank's AI Cuts Banker Prep by 90 Minutes/Day
Singular Bank's Singularity, powered by ChatGPT and Codex, delivers real-time portfolio analysis, action recommendations, and compliant comms, saving bankers 60-90 min/day on routine tasks.
Simplex Cuts Screen Dev Time 70% with Codex Agent
Simplex deploys OpenAI Codex as primary coding agent across design, dev, and testing, yielding 70% less time per screen developed, 40% for design, and 17% for integration testing on CRUD web apps.
ChatGPT Trains on Filtered Data with User Opt-Outs
OpenAI trains ChatGPT on public web data and opt-in user conversations, using Privacy Filter to mask PII before training; users control data via opt-out settings, 30-day Temporary Chats, and optional Memory.
OpenAI's Ad Principles for ChatGPT Free Tiers
OpenAI tests contextual ads in ChatGPT free/Go tiers to fund access without biasing answers, sharing chats, or limiting controls—ads match conversation topics using aggregate data only.
OpenAI's Realtime Voice Models Add Reasoning, Translation, Transcription
OpenAI's new API models—GPT-Realtime-2 for GPT-5-class voice reasoning with tools, GPT-Realtime-Translate for 70+ input to 13 output languages, and GPT-Realtime-Whisper for streaming transcription—enable natural voice agents that reason, act, and handle multilingual convos in real time.
GPT-5.5's Trusted Access Scales Cyber Defenses Safely
OpenAI's Trusted Access for Cyber (TAC) tiers GPT-5.5 access for verified defenders: standard for general use, TAC-reduced refusals for workflows like vuln triage/malware analysis, GPT-5.5-Cyber preview for red-teaming, blocking offensive misuse while accelerating defenses.
Parloa's AMP: No-Code Voice Agents via Sims & Evals
Parloa’s AMP lets non-technical users define voice AI agents in natural language, simulates conversations with GPT models as caller/agent, evaluates via LLM judges + rules, and deploys reliably—cutting human escalations 80% in one travel firm.
TwELL Delivers 20% LLM Speedups via GPU-Optimized Sparsity
Use ReLU gate activation + L1=2e-5 on hidden activations to induce 99.5% sparsity in feedforward layers, then TwELL CUDA kernels yield 20.5% inference and 21.9% training speedups on H100s with no accuracy loss.
Memori: Persistent Memory for Multi-User LLM Agents
Register OpenAI clients with Memori to automatically store/retrieve scoped memories by user entity, agent process, and session, enabling context-aware agents across turns, users, and interactions without manual prompt management.
2026 Vector DBs: Match Scale, Cost, Stack for RAG Success
Leverage existing Postgres/Mongo with pgvector (millions vectors, free) or Atlas ($30/mo max Flex) to avoid sprawl; self-host Qdrant ($30-50/mo for 50M vectors) for perf; Pinecone ($20/mo) or Milvus (100B+) for managed scale.
Showing 30 of 723