#agents
Every summary, chronological. Filter by category, tag, or source from the rail.
NVIDIA's 10x Workflows with Codex on GPT-5.5
NVIDIA's 40k engineers use Codex (GPT-5.5) to autonomously build production systems in hours and run full ML research cycles, delivering 10x speedups and 20x code efficiency gains.
Parameter Golf: Creativity in Tiny ML Models
OpenAI's 16MB/10-min ML challenge drew 1,000+ participants and 2,000+ submissions, showcasing optimizations, quantization, novel architectures, and AI agents' role in accelerating research while creating review challenges.
Interaction Models: Native Real-Time Multimodal AI
Replace turn-based AI harnesses with native interaction models using 200ms micro-turns for continuous audio/video/text processing, enabling proactive visuals and simultaneous speech—outperforming GPT/Gemini on interaction benchmarks.
Modular Hybrid-Memory Agent with OpenAI Tools
Build a production-ready autonomous agent in Python using hybrid vector+BM25 memory fused by RRF (K=60), modular tool dispatch, and a self-managing loop limited to 8 tool rounds for reliable reasoning and action.
Build Stateful Agents with File Systems & AI SDK v6
Give agents persistent sandboxes, bash tools, and memory files via AI SDK v6 to make them follow long tasks, build on prior work, and generate reusable Python scripts without manual context management.
GPU-Orchestrated Multi-Agent Sustainability Intelligence Blueprint
Chelsie Czop and Mitesh Patel demo a serverless multi-agent app using Google ADK, Gemma 4 on NVIDIA RTX PRO 6000 GPUs via Cloud Run, and Milvus RAG for real-time environmental risk reports from satellite, telemetry, and policy data.
RL Industrializes GenAI Production via Feedback Loops
95% of GenAI pilots fail production because instruction tuning and prompts can't systematically integrate defects and metrics. RL does, enabling smaller/cheaper/faster models that scale to millions in token costs at Fortune 500s like AT&T.
Gemini Enables Agentic Tasks and Prompt-Based Widgets on Android
Google's Gemini on Android now automates multi-app tasks like grocery shopping from notes to cart, browses web for bookings, fills forms, dictates naturally, and generates widgets from natural language descriptions—rolling out summer 2026 on Pixel/Samsung first.
Malleable Evals: Adaptive Testing for Changing AI Agents
Static benchmarks fail self-adapting agents; use production traces for agent-curated, always-on eval suites that self-optimize toward user intent.
CoCoDA: Co-Evolve DAGs to Scale Tool-Augmented Agents
CoCoDA uses a compositional code DAG to jointly evolve tool libraries and planners, enabling efficient retrieval from growing libraries and letting an 8B model match or beat a 32B teacher on GSM8K and MATH benchmarks.
Night Shift: Agents Run Recurring Jobs Automatically
Delegate repetitive tasks to AI agents using the Night Shift pattern—shared interface + scheduled skills + brief human reviews—so agents handle work overnight, surfacing only decisions needing your input.
Vapi's Control-Focused Voice AI Wins Ring, Hits $500M Val
Vapi beat 40 rivals to handle 100% of Amazon Ring's calls by giving engineers granular AI control, fueling $50M Series B at $500M valuation and 1B+ calls processed.
Agent OS Makes AI Agents Reliable and Scalable
Current AI agents are stateless 'goldfish' that forget tasks instantly. An Agent OS adds scheduling, memory, tools, identity, observability, and guardrails to manage them like a computer OS manages apps, enabling safe scaling.
Daybreak: AI Agents for Proactive Vuln Patching
OpenAI's Daybreak expands Codex Security (launched March 2026) to ingest repos, build threat models, validate patches in isolation, and propose fixes with human review—reducing analysis from hours to minutes via tiered GPT-5.5 models gated by Trusted Access for Cyber.
GM Cuts 600 IT Jobs to Hire AI-Native Engineers
GM laid off 600 IT workers (10% of department) to recruit specialists in agent/model development, prompt engineering, data pipelines—showing enterprises must rebuild teams for production AI, not just add tools.
Embed Pi Coding Agents via CLI Tools in Products
Pi's minimal TypeScript SDK powers LLM agents that loop tools; expose CRM/ERP data as secure CLIs for natural agent use, as in a B2B sales pipeline routing RFP emails to per-customer sessions that output inbox drafts.
Harness Engineering: Stack Rules, Skills & Agents for Reliable AI Dev
Harness Engineering builds reliable AI code generation by stacking Rules (guidelines), Skills (SOPs), Sub-Agents (roles), Workflows (handoffs), Scripts (gates), and MCP (external tools) into a verifiable system, demonstrated in a minimal Go CLI project.
HTML Replaces Markdown for Interactive AI Outputs
Prompt AI agents for single-file HTML instead of long Markdown reports to create navigable, editable, interactive artifacts that humans can actually use, review, share, and act on.
Frontier Firms Use 3.5x More AI Depth Per Worker
Frontier firms (95th percentile) now demand 3.5x more intelligence per worker than typical firms (up from 2x), driven by complex agentic workflows like 16x more Codex use, not just message volume.
Uber's OpenAI-Powered Multi-Agent AI Optimizes Earnings and Booking
Uber deploys OpenAI models via multi-agent architecture for Uber Assistant, delivering real-time driver guidance from marketplace data and voice-based ride booking, accelerating new driver ramp-up versus hundreds of trips via trial-and-error.
Simplex Cuts Screen Dev Time 70% with Codex Agent
Simplex deploys OpenAI Codex as primary coding agent across design, dev, and testing, yielding 70% less time per screen developed, 40% for design, and 17% for integration testing on CRUD web apps.
OpenAI's Realtime Voice Models Add Reasoning, Translation, Transcription
OpenAI's new API models—GPT-Realtime-2 for GPT-5-class voice reasoning with tools, GPT-Realtime-Translate for 70+ input to 13 output languages, and GPT-Realtime-Whisper for streaming transcription—enable natural voice agents that reason, act, and handle multilingual convos in real time.
Parloa's AMP: No-Code Voice Agents via Sims & Evals
Parloa’s AMP lets non-technical users define voice AI agents in natural language, simulates conversations with GPT models as caller/agent, evaluates via LLM judges + rules, and deploys reliably—cutting human escalations 80% in one travel firm.
OpenAI's Codex Controls: Sandbox, Rules, Telemetry
OpenAI deploys Codex coding agents with sandboxing for bounded execution, auto-approvals for low-risk actions, network/command restrictions, and OpenTelemetry logs to enable safe, auditable developer workflows without broad access.
Scaling AI Agents to Slack Company Coworkers
Viktor turns personal AI agents into company employees by living in Slack, inheriting one-time integrations for 3,000 tools, isolating memory across channels/DMs, and handling Slack's complex inputs like threads, edits, and drifts—while preserving model personality for user trust.
Memori: Persistent Memory for Multi-User LLM Agents
Register OpenAI clients with Memori to automatically store/retrieve scoped memories by user entity, agent process, and session, enabling context-aware agents across turns, users, and interactions without manual prompt management.
Replay Logs Fail Agents: Use VM Snapshots Instead
Replay durability constrains agent code with growing logs; split into context logs (DB durable) and execution snapshots (14MB Firecracker VMs, <1s save/100ms restore) for multi-day sessions.
AI EngineerFix Agent Context with Head/Tail + Memory, Not Summaries
Truncation breaks reasoning by forgetting history; summarization lacks control. Head/tail truncation preserves key context (first/last 100 chars), stores middle in retrievable memory, and offloads heavy tasks to sub-agents for reliable performance.
AI Agents Surge in Finance and Productivity Tools
Anthropic offers 10 finance agent templates for Claude; Perplexity launches finance workflows; Cursor spawns parallel subagents; Claude code limits double for faster dev workflows.
OpenClaw and Passion Beat Hierarchy in LLM Teams
Luo Fuli leads Xiaomi's 100-person MiMo LLM team with no titles or sub-teams, using OpenClaw agents to cut research from 30-40 weeks to 3-4 weeks, proving passion and frameworks outperform traditional management.
Showing 30 of 729