Summaries · #devops

DAY 01Today MAY 13 · 20262 SUMMARIES

OpenAI NewsDevOps & CloudMay 13, 2026

Custom Elevated Sandbox Enables Safe Codex on Windows

OpenAI built a custom Windows sandbox for Codex using dedicated users, restricted tokens, firewall rules, and multi-binary setup to limit writes to workspace, block outbound network by default, and grant user-like reads without constant approvals.

OpenAI News

AI EngineerDevOps & CloudMay 13, 2026

CI/CD Breaks for Agents: Use Continuous Compute Loops

Traditional CI/CD chokes on thousands of agent PRs with cache thrash and merge bottlenecks; replace with intent-driven agent loops featuring inline validation, premerge reconciliation, and stateful continuous compute for sub-minute iterations.

DAY 02Monday MAY 11 · 20262 SUMMARIES

OpenAI NewsDevOps & CloudMay 11, 2026

MRC: Resilient Networking for 100K+ GPU AI Training

OpenAI's MRC protocol uses multi-plane topologies and packet spraying across hundreds of paths with SRv6 source routing to eliminate congestion, route around failures in microseconds, and connect 131k GPUs with just two switch tiers, enabling non-stop frontier model training.

OpenAI News

OpenAI NewsAI & LLMsMay 11, 2026

OpenAI's Codex Controls: Sandbox, Rules, Telemetry

OpenAI deploys Codex coding agents with sandboxing for bounded execution, auto-approvals for low-risk actions, network/command restrictions, and OpenTelemetry logs to enable safe, auditable developer workflows without broad access.

DAY 03Friday MAY 8 · 20261 SUMMARIES

Level Up CodingDevOps & CloudMay 8, 2026

AWS KMS Envelope Encryption Secures Data at Scale

Encrypt data efficiently with AWS KMS envelope pattern: Use master keys to generate ephemeral AES-256 DEKs for fast local encryption/decryption, storing only encrypted DEKs alongside ciphertext for auditable, revocable access.

Level Up Coding

DAY 04Thursday MAY 7 · 20261 SUMMARIES

MarkTechPostDevOps & CloudMay 7, 2026

MRC: OpenAI's Protocol for Resilient AI Training Networks

OpenAI's MRC extends RoCE with multipath spraying, microsecond failure recovery via SRv6, and multi-plane designs to deliver predictable performance in 131k-GPU clusters, using 2/3 fewer optics and 3/5 fewer switches than traditional setups.

MarkTechPost

DAY 05May 6, 2026 MAY 6 · 20262 SUMMARIES

The DecoderAI News & TrendsMay 6, 2026

MRC Enables 100k+ GPU Clusters with Resilient Multipath Networking

OpenAI's MRC protocol spreads packets across hundreds of paths for microsecond failure recovery, connecting 100,000+ GPUs via just 2 switch tiers—cutting power, cost, and downtime in AI training supercomputers.

The Decoder

Level Up CodingSoftware EngineeringMay 6, 2026

Ditch preferred_username for Azure AD Guest Auth

Using preferred_username as identity anchor worked for employees but failed silently for all B2B guests, causing 403 errors post-launch. Anchor on oid instead for reliable identification.

DAY 06May 5, 2026 MAY 5 · 20264 SUMMARIES

AI EngineerAI AutomationMay 5, 2026

SIE: Dynamic Inference for Small Models on Shared GPUs

Open-source SIE engine from Superlinked enables hot-swapping small embedding models (e.g., Stella, ColBERT) on one GPU via LRU eviction, cutting costs and solving context rot in agents by preprocessing data.

AI Engineer

Google Cloud TechAI & LLMsMay 5, 2026

Secure AI Agents via MCP Toolbox Custom Tools

MCP Toolbox prevents confused deputy attacks by letting developers pre-write constrained SQL tools with bound parameters, separating agent flexibility from app-controlled security for runtime agents.

Python in Plain EnglishDevOps & CloudMay 5, 2026

Replace Cron with Temporal for Reliable Data Jobs

Cron fails on retries, overlaps, and writes due to zero observability. Temporal workflows add retries (3s initial, 2x backoff, 8 max attempts), atomic writes, unique output files per run ID, SKIP overlap policy, and full execution history via UI—surviving crashes with state in Temporal.

Generative AIAI AutomationMay 5, 2026

Self-Host Vane + Ollama for Private AI Web Research

Install Vane in Docker on Windows 11 with local Ollama and Qwen3.5:9b to run citation-backed searches privately, bypassing cloud services like OpenAI.

DAY 07May 3, 2026 MAY 3 · 20262 SUMMARIES

IBM TechnologyDevOps & CloudMay 3, 2026

Proactive Synthetic Monitoring Catches DevOps Failures Early

Simulate user actions like logins, searches, and API calls to detect regressions, availability issues, and performance degradation before production traffic, integrating tests into CI/CD for consistent validation.

IBM Technology

Towards AIAI & LLMsMay 3, 2026

SageMaker Fine-Tuning: LoRA Beats QLoRA on Cost-Perf Balance

LoRA cuts trainable params by 96% vs full fine-tuning, balancing cost savings and accuracy on Llama2-7B/Mistral7B; QLoRA saves 8x memory but trains slower due to dequantization overhead.

DAY 08May 1, 2026 MAY 1 · 20261 SUMMARIES

IBM TechnologyAI & LLMsMay 1, 2026

Composable Specialists Beat Monoliths for Enterprise AI

Panel agrees enterprises need Granite 4.1's task-specific models and Bob's orchestration for cost control, with DiLoCo enabling distributed training to sidestep grid limits.

IBM Technology

DAY 09April 30, 2026 APR 30 · 20263 SUMMARIES

Google Cloud TechDevOps & CloudApr 30, 2026

Bigtable Scales Petabytes for Real-Time NoSQL Workloads

Bigtable auto-scales to hundreds of petabytes and millions of ops/sec with low latency, powering Google Search/YouTube/Maps; ideal for time series, ML features, and streaming via Flink/Kafka integrations.

Google Cloud Tech

Learning DataDevOps & CloudApr 30, 2026

Scale PyTorch DDP Multi-Node on AWS EC2: Infra-First Guide

Multi-node DDP demands identical environments, data access, and open security groups across EC2 instances; use torchrun launcher with DDPManager for minimal code changes and reliable gradient sync via NCCL.

Caleb Writes CodeAI News & TrendsApr 30, 2026

TPUs Dominate at Infrastructure Scale Over Per-Chip GPU Wins

Google's TPU v8t (training) and v8i (inference) lag Nvidia GPUs per chip but deliver superior performance at scale—9600-chip superpods hit 121 exaFLOPS FP4—via cube topology and Virgo networking, optimizing for AI's bandwidth-heavy workloads.

DAY 10April 29, 2026 APR 29 · 20263 SUMMARIES

Dwarkesh PatelApr 29, 2026

Batch Size Unlocks 1000x LLM Inference Efficiency

Reiner Pope deduces frontier LLM training and serving mechanics from roofline analysis, revealing batch size as the core driver of latency-cost tradeoffs, with optimal batches of ~2000 tokens amortizing weights for massive gains.

Dwarkesh Patel

Theo - t3.ggAI & LLMsApr 29, 2026

Claude Code's DIY-Heavy Tech Stack Picks

Claude Code prefers custom/DIY solutions in 12/20 tooling categories but defaults to Vercel (100% JS deploys), Stripe (91% payments), Shadcn (90% UI), GitHub Actions (94% CI/CD), revealing AI's influence on new dev stacks.

Generative AIDevOps & CloudApr 29, 2026

GitHub RCE via Single Git Push X-Stat Injection

Authenticated users exploited X-Stat field injection in GitHub's internal git protocol for RCE on GitHub.com and GHES using a standard git push, enabling access to millions of repos (CVE-2026-3854, High severity).

DAY 11April 19, 2026 APR 19 · 20261 SUMMARIES

DIY Smart CodeAI & LLMsApr 19, 2026

Scaffold AI Agent Prod Infra in 60s with Google Starter Pack

Google's Agent Starter Pack CLI generates full production-ready AI agent stack—FastAPI backend, Terraform IaC, CI/CD, Vertex AI eval, observability—in 60 seconds, cutting typical 3-9 month infra setup to minutes across 6 templates.

DIY Smart Code

DAY 12April 18, 2026 APR 18 · 20263 SUMMARIES

Google Cloud TechAI & LLMsApr 18, 2026

Gemma 4 Prod Stack: Model Armor, ADK Agents, Tracing

Deploy secure, observable Gemma 4 agents on Cloud Run using load balancers for Model Armor integration, ADK for model-agnostic agents with vLLM, and Prometheus/Cloud Trace for metrics like GPU util and latency.

Google Cloud Tech

Towards AIDevOps & CloudApr 18, 2026

Mount S3 Buckets as File Systems with AWS S3 Files

AWS S3 Files mounts buckets directly as file systems on EC2, containers, and Lambda—eliminating FUSE hacks and sync scripts for AI/ML workflows, but misconfigurations risk exposing, corrupting, or losing data.

Google Cloud TechAI & LLMsApr 18, 2026

Self-Host Gemma 4 on Cloud Run GPUs: Ollama vs vLLM

Deploy open Gemma 4 LLM on serverless Cloud Run GPUs two ways: Ollama bakes model into container for instant cold starts; vLLM mounts from GCS FUSE for model swaps without rebuilds. Full CI/CD via Cloud Build.

DAY 13April 16, 2026 APR 16 · 20261 SUMMARIES

Python in Plain EnglishDeveloper ProductivityApr 16, 2026

AI Drafts Code Fast But Misses Context and Silent Bugs

Fully delegating dev workflow to AI sped up drafting but caused production issues like hollow tests, context-blind pipelines, AI self-reviews, and 34% webhook drop from unmodeled behavioral changes. Humans must supply context, break review loops, and validate impacts.

Python in Plain English

DAY 14April 15, 2026 APR 15 · 20261 SUMMARIES

Level Up CodingDevOps & CloudApr 15, 2026

Zero Leak Debt: Kill 100+ Leaked Secrets Platform-Wide

Leaked secrets from 2022 still process payments as 'leak debt'; ruthlessly audit across local dev, CI/CD, and production to reach zero static secrets that never leak, expire unexpectedly, or need manual rotation.

Level Up Coding

DAY 15April 14, 2026 APR 14 · 20261 SUMMARIES

Agrici DanielAI AutomationApr 14, 2026

8 AI Agents Turn Terminal into Free Cyber Audit Lab

One command spawns 8 specialist AI agents in Claude Code to audit codebases for vulnerabilities across OWASP Top 10, CWE Top 25, and more—boosted Claude Ads score from 62/100 (C) to 90/100 after fixes.

Agrici Daniel

DAY 16April 9, 2026 APR 9 · 20262 SUMMARIES

Google Cloud TechDevOps & CloudApr 9, 2026

Scaling TPUs on GKE for Massive AI Workloads

GKE treats TPU slices as atomic units for seamless scaling up to 9k+ chips, with flexible capacity like DWS Flex/Calendar and custom fallbacks for cost-efficient ML training/inference.

Google Cloud Tech

DIY Smart CodeDevOps & CloudApr 9, 2026

Self-Host Archon v3 on Hetzner VPS with Docker

Provision Hetzner VPS, apply cloud-init YAML for auto-setup of Archon v3 with Caddy HTTPS reverse proxy, Postgres DB, then configure .env secrets and optional form auth for secure 24/7 access via subdomain.