#gpu
Every summary, chronological. Filter by category, tag, or source from the rail.
Optimizing LLM Inference: KV Cache and Paged Attention
LLM inference latency and throughput bottlenecks are often caused by inefficient GPU memory management. Using KV caching, paged attention, and specific tuning techniques like chunked prefill can drastically improve performance.
IBM TechnologyScaling AI Agents and Inference on Google Cloud Run
Google Cloud Run is evolving from a web-service platform into a comprehensive runtime for AI agents, inference, and background tasks, introducing features like GPU support, sandboxed code execution, and custom scaling controls.
Google Cloud TechFlash-KMeans: Accelerating Exact Clustering on GPUs
Flash-KMeans optimizes Lloyd's k-means algorithm for GPUs by restructuring dataflow to eliminate HBM bottlenecks, achieving up to 200x speedups over FAISS without sacrificing mathematical accuracy.
Deploying GPU Workloads Directly from Your IDE with RunPod Flash
RunPod's Flash SDK allows developers to deploy and iterate on GPU-accelerated Python functions directly from their IDE using a simple decorator, eliminating the need for manual Docker builds and container registry management.
AI EngineerBuilding Tiled GPU Kernels with NVIDIA cuTile Python
NVIDIA cuTile allows developers to write efficient, tile-based GPU kernels directly in Python, providing a structured way to handle memory access and computation that can be benchmarked against standard PyTorch operations.
mKernel: Fusing Compute and Communication for GPU-Driven Scaling
mKernel eliminates host-driven communication bottlenecks by fusing intra-node NVLink, inter-node RDMA, and compute into persistent CUDA kernels, enabling fine-grained overlap at the tile level.
Showing 6 of 6