№ 02 / SUMMARIES

#gpu

Every summary, chronological. Filter by category, tag, or source from the rail.

Tag · #gpu
DAY 01Today JUN 30 · 20261 SUMMARIES
IBM TechnologyAI & LLMs

Optimizing LLM Inference: KV Cache and Paged Attention

LLM inference latency and throughput bottlenecks are often caused by inefficient GPU memory management. Using KV caching, paged attention, and specific tuning techniques like chunked prefill can drastically improve performance.

IBM Technology
DAY 02Thursday JUN 25 · 20261 SUMMARIES
Google Cloud TechInference & Serving

Scaling AI Agents and Inference on Google Cloud Run

Google Cloud Run is evolving from a web-service platform into a comprehensive runtime for AI agents, inference, and background tasks, introducing features like GPU support, sandboxed code execution, and custom scaling controls.

Google Cloud Tech
DAY 03June 15, 2026 JUN 15 · 20261 SUMMARIES
MarkTechPostSoftware Engineering

Flash-KMeans: Accelerating Exact Clustering on GPUs

Flash-KMeans optimizes Lloyd's k-means algorithm for GPUs by restructuring dataflow to eliminate HBM bottlenecks, achieving up to 200x speedups over FAISS without sacrificing mathematical accuracy.

MarkTechPost
DAY 04June 9, 2026 JUN 9 · 20262 SUMMARIES
AI EngineerAI Automation

Deploying GPU Workloads Directly from Your IDE with RunPod Flash

RunPod's Flash SDK allows developers to deploy and iterate on GPU-accelerated Python functions directly from their IDE using a simple decorator, eliminating the need for manual Docker builds and container registry management.

AI Engineer
MarkTechPostSoftware Engineering

Building Tiled GPU Kernels with NVIDIA cuTile Python

NVIDIA cuTile allows developers to write efficient, tile-based GPU kernels directly in Python, providing a structured way to handle memory access and computation that can be benchmarked against standard PyTorch operations.

DAY 05May 30, 2026 MAY 30 · 20261 SUMMARIES
MarkTechPostSoftware Engineering

mKernel: Fusing Compute and Communication for GPU-Driven Scaling

mKernel eliminates host-driven communication bottlenecks by fusing intra-node NVLink, inter-node RDMA, and compute into persistent CUDA kernels, enabling fine-grained overlap at the tile level.

MarkTechPost

Showing 6 of 6