MRC: Resilient Networking for 100K+ GPU AI Training

Multi-Plane Topologies Slash Switch Tiers and Power for Massive Clusters

Traditional 800Gb/s networks require three or four tiers of switches to connect over 100,000 GPUs, increasing power use, failure points, and cost. MRC splits each 800Gb/s interface into eight 100Gb/s links, creating eight parallel 'planes' that connect to separate switches. A 64-port 800Gb/s switch now handles 512 ports at 100Gb/s, enabling full connectivity for 131,000 GPUs using only two tiers. This design boosts path diversity—keeping more traffic local to Tier 0 switches—while cutting components, power, and cost compared to single-plane setups. Without changes, single-path flows (like classic RoCE) still congest links as flows collide, especially in AI's collective communications where worst-case latency stalls synchronous training.

Packet Spraying and SRv6 Eliminate Congestion and Dynamic Routing

MRC sprays packets from a single transfer across hundreds of paths spanning all planes, using final memory addresses for out-of-order reassembly at the destination. Adaptive load-balancing monitors paths: congestion triggers path swaps, packet loss retires the path (with probes for recovery), and 'packet trimming' at switches forwards headers only during destination congestion to prompt retransmits without false failure alarms. This achieves microsecond failure detection and rerouting, versus seconds for traditional fabrics. MRC replaces BGP dynamic routing with static SRv6 source routing: senders embed full switch ID sequences in IPv6 addresses. Switches shift addresses and follow pre-configured static tables, blindly forwarding without recomputing routes. Failures simply retire paths at endpoints, simplifying control planes and eliminating routing bugs from switch software.

Production Impact: Zero-Measurable Downtime Amid Constant Failures

In OpenAI's NVIDIA GB200 supercomputers (including OCI's Abilene Stargate site and Microsoft's Fairwater), MRC handles millions of links with frequent flaps—multiple per minute between tiers—yet synchronous pretraining jobs show no measurable impact, allowing deferred repairs. Rebooting four Tier-1 switches or repairing links during jobs requires no coordination; MRC avoids bad paths automatically. Real training data shows quick recovery from full T1 switch loss with temporary slowdowns far less than physical capacity loss (e.g., one failed port on an 8-port interface reduces max rate by 1/8th but sustains better effective throughput via path recalculation). Multi-job clusters avoid inter-job interference due to core-wide congestion elimination, maximizing GPU utilization for frontier models like those powering ChatGPT (900M weekly users).

Strategic Wins: Simpler Stacks for Stargate-Scale Compute

MRC delivers three edges: two-tier multi-plane redundancy with lower power; zero core congestion for consistent flow throughput in sync training; and SRv6 for instant failure bypass via static planes. Deployed with AMD, Broadcom, Intel, Microsoft, NVIDIA hardware, it's released via Open Compute Project for industry adoption, supporting OpenAI's compute strategy of shared standards to scale AI infrastructure efficiently.

Multi-Plane Topologies Slash Switch Tiers and Power for Massive Clusters

Packet Spraying and SRv6 Eliminate Congestion and Dynamic Routing

Production Impact: Zero-Measurable Downtime Amid Constant Failures

Strategic Wins: Simpler Stacks for Stargate-Scale Compute

More from DevOps & Cloud

MRC: OpenAI's Protocol for Resilient AI Training Networks

MRC Enables 100k+ GPU Clusters with Resilient Multipath Networking

TPUs Dominate at Infrastructure Scale Over Per-Chip GPU Wins

AWS Project Rainier: 500K Trainium2 Chips Power Massive AI Cluster