Services as AI's Enterprise Revenue Frontier

AI labs recognize that raw model intelligence requires custom integration to deliver business value, spawning dedicated services firms. Anthropic's $1.5B JV with Blackstone, Hellman & Friedman, and Goldman Sachs ($300M each) deploys small teams to co-develop Claude-powered systems tailored to operations, starting with impact assessment. OpenAI's Deployment Company, backed by 19 investors like TPG and Bain Capital, raised $4B at $10B pre-money valuation under COO Brad Lightcap to sell enterprise software via PE partnerships. Finance leads verticals: Anthropic's second-highest revenue segment, with NYC event and templates for pitch generation, valuation review, KYC, month-end close integrating FactSet/S&P/Morningstar; Perplexity's 35 workflows plus licensed data. Startups like Tessera compete on system integration but with far less capital. Aaron Levie notes agents demand IT upgrades, workflow modernization, context provisioning, human-agent handoffs, adoption, and change management—no shortcuts exist, creating jobs and firms. Builders: Target services for stable revenue over raw API calls, as labs prioritize 'last mile' deployment.

Agent Performance Hinges on Harness, Not Just Models

Model quality alone fails to predict agent success; Model–Harness–Task fit dominates. Native post-trained harnesses outperform open or AGI-like generalization; productized agents rely on instructions, tools, context packing, and measurement loops—base models expose this gap. Coding UX fragments: Hermes beats DeepSeek-TUI/OpenCode on success/speed/cost; Codex overtook Claude Code in downloads, though Claude feels stagnant. Benchmarks like Meta's ProgramBench (200 tasks building SQLite/FFmpeg/PHP compiler from specs, no code/internet) show 0% top accuracy but >50% test passes per task—strict all-tests criteria prevent gaming. Practical wins: Cursor auto-fixes CI failures; Cognition's Devin for Security remediates vulns, flagged malicious axios pre-disclosure. Observability evolves: Attach feedback to traces for learning loops (gather data → mine errors → localize → fix → test); tools like Raindrop Triage hunt bad behavior. Builders: Decouple harnesses (ACP-style) for frontend swaps; optimize cache hits (main cost axis); design for long-horizon via macro actions/horizon reduction to beat capacity limits.

Inference Efficiency Leaps with Open MTP and Systems

Speculative decoding matures openly: Google's Gemma 4 MTP drafters (e.g., gemma-4-31B-it-assistant, 78M E2B) yield up to 3x faster decoding without quality loss, integrated day-zero in Transformers/vLLM/MLX/SGLang/Ollama/AI Edge. Llama.cpp beta MTP (PR #22673) hits 75% acceptance with 3 draft tokens, >2x throughput on Qwen3 27B/35B-A3B, narrowing vLLM gaps for dense models. RadixArk's $100M seed scales SGLang/Miles for frontier inference/RL/orchestration across hardware, avoiding per-team rebuilds of KV-cache/scheduling. Provider economics vary wildly: SambaNova at 435 tok/s, Fireworks speed/price leader; cache optimization cuts agent costs. Cold starts drop 60x by GPU-serving weights; DeepMind's Decoupled DiLoCo hits 88% goodput (vs 27%) with 240x less bandwidth. OpenAI's GPT-5.5 Instant defaults ChatGPT/API with factuality/intelligence/image/tone gains, plus memories/chats/files/Gmail personalization and WebRTC rebuild (thin relay + stateful transceiver) for voice latency. Agents SDK (TypeScript, sandbox/open harness). Builders: Prioritize MTP for throughput; compare providers on tok/s/cache/cost; build RL envs scaling to thousands (Forge/ROLL/Slime/Seer).