Summaries · #inference

DAY 01Yesterday JUN 29 · 20267 SUMMARIES

Latent Space (Newsletter)Inference & ServingJun 29, 2026

SpaceX's Neocloud and the Rise of Owned Intelligence

SpaceX is emerging as a massive compute provider with $28B/year in annualized GPU rental deals, while developers increasingly prioritize 'owned intelligence' via open-weight models like GLM-5.2 to gain control over their AI stacks.

Latent Space (Newsletter)

Latent Space (Newsletter)Models & Frontier LabsJun 29, 2026

OpenAI's GPT-5.6 Launch: Frontier Models as Managed Assets

OpenAI released the GPT-5.6 family (Sol, Terra, Luna) as a restricted, government-mediated preview, signaling a shift where release governance is now a core component of the model specification.

Latent Space (Newsletter)Agents & OrchestrationJun 29, 2026

The Rise of Meta-Harnesses and Vertical AI Integration

The AI industry is shifting toward 'meta-harnesses'—standardized agent orchestration layers—while frontier labs move toward vertical integration of custom silicon and agent-native UX.

Latent Space (Newsletter)Agents & OrchestrationJun 29, 2026

Internal AI Adoption & The Rise of Agentic Workflows

OpenAI reports massive internal token growth across all departments, signaling that agentic workflows—supported by review loops and persistent infrastructure—are moving from experimental to core production patterns.

Together AI BlogInference & ServingJun 29, 2026

ParallelKernelBench: Frontier LLMs Struggle with Multi-GPU Kernels

While LLMs excel at single-GPU kernel generation, they currently struggle with multi-GPU tasks where communication bottlenecks and complex rank coordination dominate performance.

Hugging Face BlogInference & ServingJun 29, 2026

Deploying vLLM Endpoints on Hugging Face Jobs

Hugging Face Jobs allows engineers to spin up private, OpenAI-compatible vLLM endpoints on demand using a single command, providing a pay-per-second alternative for testing and experimentation.

AI EngineerInference & ServingJun 29, 2026

Prototype Big, Deploy Small: A Framework for Local LLM Adoption

Stop overpaying for frontier models. By using a 'prototype big, deploy small' framework and rigorous capability evals, you can identify 'Sage' (Small and Good Enough) models that provide production-grade performance on-device, saving costs and improving latency.

DAY 02Sunday JUN 28 · 20263 SUMMARIES

Machine Learning Street TalkInference & ServingJun 28, 2026

Thermodynamic Computing and the Future of AI-Driven Chip Design

Thomas Ahle of Normal Computing discusses using AI agents to automate chip design, the risks of 'understanding debt' in agentic code, and the development of thermodynamic chips that use physical noise to perform stochastic computations.

Machine Learning Street Talk

AI EngineerAgents & OrchestrationJun 28, 2026

Building Low-Latency Voice-In, Visuals-Out AI Agents

To achieve a seamless AI UX, shift from voice-in/voice-out to voice-in/visuals-out. This leverages the human brain's visual processing capacity and a more forgiving 1-second latency budget compared to the strict 200ms required for fluid speech.

OpenAI NewsInference & ServingJun 28, 2026

OpenAI and Broadcom Unveil Jalapeño Inference Chip

OpenAI and Broadcom have developed 'Jalapeño,' a custom ASIC designed specifically for LLM inference, aiming to improve performance-per-watt and reduce latency through hardware-software co-design.

DAY 03Friday JUN 26 · 20263 SUMMARIES

TechCrunch — AIModels & Frontier LabsJun 26, 2026

OpenAI Limits GPT-5.6 Rollout Amid Government Oversight

OpenAI is restricting the release of its new GPT-5.6 model lineup to a select group of partners following U.S. government intervention, highlighting growing friction between frontier AI development and emerging regulatory oversight.

TechCrunch — AI

TechCrunch — AIInference & ServingJun 26, 2026

The Strategic Shift Toward Custom AI Silicon

Major tech players are developing custom chips to mitigate single-supplier risk, optimize hardware for specific workloads, and achieve performance gains similar to Apple's transition away from Intel.

IBM TechnologyInference & ServingJun 26, 2026

Scaling Beyond 2D: IBM’s Nano Stack and the Rise of Orchestration

IBM introduces a 0.7nm 'nano stack' chip architecture to overcome 2D scaling limits, while the panel debates the shift from monolithic model development to multi-model orchestration as the new frontier for AI performance.

DAY 04Thursday JUN 25 · 20261 SUMMARIES

Google Cloud TechInference & ServingJun 25, 2026

Scaling AI Agents and Inference on Google Cloud Run

Google Cloud Run is evolving from a web-service platform into a comprehensive runtime for AI agents, inference, and background tasks, introducing features like GPU support, sandboxed code execution, and custom scaling controls.

Google Cloud Tech

DAY 05June 4, 2026 JUN 4 · 20261 SUMMARIES

AI EngineerAI & LLMsJun 4, 2026

Text Diffusion: Low-Latency Generation and Bidirectional Reasoning

Text diffusion models offer significantly lower latency than autoregressive models by generating text in parallel blocks, enabling bidirectional reasoning, self-correction, and dynamic computation.

AI Engineer

DAY 06May 22, 2026 MAY 22 · 20261 SUMMARIES

Google Cloud TechAI & LLMsMay 22, 2026

Mastering the AI Stack: From Agents to Energy

Understanding the full AI stack—from agentic frameworks down to data center energy requirements—is essential for developers to optimize model performance, hardware constraints, and inference efficiency.

Google Cloud Tech

DAY 07March 15, 2026 MAR 15 · 20261 SUMMARIES

Hugging Face BlogModels & Frontier LabsMar 15, 2026

Accelerating MoE Fine-Tuning with NVIDIA NeMo AutoModel

NVIDIA NeMo AutoModel extends Hugging Face Transformers v5 to provide 3.4-3.7x higher training throughput and 29-32% lower memory usage for MoE models by integrating Expert Parallelism, DeepEP, and TransformerEngine kernels.

Hugging Face Blog

DAY 08June 30, 2025 JUN 30 · 20251 SUMMARIES

Hugging Face BlogInference & ServingJun 30, 2025

Optimizing Browser AI with Cross-Origin Storage

The proposed Cross-Origin Storage (COS) API allows web apps to share large AI model and Wasm files across different origins using cryptographic hashes, eliminating redundant downloads and storage.

Hugging Face Blog