№ 02 / SUMMARIES

#llm

Every summary, chronological. Filter by category, tag, or source from the rail.

Tag · #llm
DAY 01Today JUN 30 · 202611 SUMMARIES
arXiv cs.AIAI & LLMs

Steering LLM Personality via Latent Feature Interventions

Researchers have developed a mechanistic method to steer LLM personality traits by identifying and modifying latent features in the model's residual stream using sparse autoencoders, enabling precise behavioral control without retraining.

arXiv cs.AI
arXiv cs.AIAI & LLMs

HyphaeDB: Moving From Passive Storage to Agent-Native Memory

HyphaeDB reinterprets HNSW graph topology as a communication fabric for multi-agent systems, enabling knowledge propagation and emergent consensus rather than just passive retrieval.

arXiv cs.AIAI & LLMs

ComMem: Dual-Memory Systems for VLM Test-Time Adaptation

ComMem improves VLM robustness by mimicking biological memory, using a fast-adapting visual cache and a slow-integrating textual prototype system to maintain cross-modal consistency during test-time adaptation.

arXiv cs.AIAI & LLMs

Agentic Abstention: Improving When LLM Agents Should Stop

LLM agents often fail to stop when a task is impossible, leading to unnecessary tool use. The CONVOLVE method improves timely abstention by distilling interaction trajectories into reusable stopping rules.

arXiv cs.AIAI & LLMs

Agent Safety Is Action Alignment, Not Content Refusal

Treating agent safety like chatbot content moderation is a category error. True agent security requires enforcing least privilege at the action boundary, not training models to refuse requests.

arXiv cs.AIAI & LLMs

Making LLM Self-Evolution Safe with Held-Out Selection

RSEA improves LLM agent performance by recursively evolving natural-language artifacts while using a strict held-out validation gate to prevent performance regression.

arXiv cs.AIAI & LLMs

Stabilizing Critic-Free RL with BV-Blend

BV-Blend improves reinforcement learning stability by blending prompt-local statistics with historical cluster-based moments, preventing training stalls when reward variance is zero.

arXiv cs.AIAI & LLMs

Closing the Loop Between Model Evaluation and Data Intervention

By introducing 'capability slices'—groups of evaluation samples categorized by task and operation—engineers can transform benchmark failures into precise, actionable data interventions rather than relying on intuition.

arXiv cs.AIAI & LLMs

GPTNT: A Real-Time Collaborative Benchmark for AI Agents

GPTNT uses the game 'Keep Talking and Nobody Explodes' to test AI agent collaboration under time pressure, revealing critical failures in state tracking and real-time communication.

IBM TechnologyAI & LLMs

Optimizing LLM Inference: KV Cache and Paged Attention

LLM inference latency and throughput bottlenecks are often caused by inefficient GPU memory management. Using KV caching, paged attention, and specific tuning techniques like chunked prefill can drastically improve performance.

TechCrunch — AIAI & LLMs

Why Vibe Coding Platform Base44 is Building Its Own AI Model

Base44 is transitioning to a vertically integrated stack by training its own LLM to gain control over latency, costs, and performance, signaling a shift toward defensibility for AI-native startups.

DAY 02Yesterday JUN 29 · 202619 SUMMARIES
Level Up CodingAI & LLMs

Stop Blaming Your RAG Pipeline: 16 Production Techniques

Most RAG failures are pipeline issues, not model limitations. Improving retrieval precision through hybrid search, reranking, and rigorous evaluation is more effective than simply swapping models.

Level Up Coding
Level Up CodingAI & LLMs

Ornith-1.0: Coding Models That Learn Their Own Harness

Ornith-1.0 achieves state-of-the-art performance for its size by incorporating the coding harness into the model's training gradient, allowing the model to dynamically generate its own execution scaffolds rather than relying on static, human-written ones.

Level Up CodingAI & LLMs

Optimizing RAG Retrieval with Hierarchical Search

Hierarchical RAG improves precision and reduces computational costs by replacing flat, corpus-wide similarity searches with a two-stage process: document-level filtering followed by targeted chunk retrieval.

AI EngineerAI & LLMs

Building Great Agent Skills: The Missing Manual

To escape 'skill hell,' developers must treat agent skills as structured, maintainable code by optimizing triggers, minimizing context bloat, using 'leading words' for steering, and aggressively pruning irrelevant instructions.

Google Cloud TechAI & LLMs

Building Production-Grade Multi-Agent Systems with ADK

Learn to build robust, state-aware multi-agent systems using Google's Agent Development Kit (ADK) and the Model Context Protocol (MCP) to handle orchestration, security, and persistence.

arXiv cs.AIMLOps & Infrastructure

Scaling Item Knowledge with JD's Oxygen AIIC Platform

JD.com's Oxygen AIIC uses a hybrid LLM/VLM architecture to automate item-knowledge production at scale, achieving 94.2% precision and 82.8% recall across tens of billions of SKUs.

arXiv cs.AIAgents & Orchestration

Agent-Native Immune System (ANIS): Architecture for Runtime Defense

The Agent-Native Immune System (ANIS) shifts AI security from static training-time alignment to dynamic, runtime defense, using a six-layer 'Immune Tower' to protect autonomous agents against memory poisoning and tool-chain manipulation.

arXiv cs.AIRAG & Retrieval

DysLexLens: Analyzing Dyslexic AI User Experiences via LLMs

DysLexLens is an end-to-end framework that extracts, structures, and validates insights from noisy online forum data to understand how dyslexic learners interact with AI tools.

arXiv cs.AIAgents & Orchestration

ToE: Hierarchical Claim Verification Against Adversarial Misinformation

Tree of Evidence (ToE) is a fact-checking framework that uses a reinforcement learning-driven agent to decompose claims into hierarchical argument trees, significantly improving verification accuracy against adversarially poisoned inputs.

arXiv cs.AIAgents & Orchestration

Improving Long-Horizon LLM Planning via Symbolic Feedback

This framework enhances LLM planning reliability by using a symbolic verifier to identify errors and provide corrective, interpretable instructions for iterative self-refinement.

arXiv cs.AIAgents & Orchestration

Personality Prompting in Multi-Agent Teams: Task-Dependent Impact

Personality manipulation in LLM agents significantly alters communication style but only degrades task performance in open-ended or collaborative domains, while remaining largely neutral in structured coding tasks.

The Pragmatic Engineer (Gergely Orosz)Coding Agents & Dev Productivity

The Shift in Software Engineering: AI Agents and Production Risk

AI agents have fundamentally transformed software development in six months, enabling massive increases in code output. However, this shift risks quality and security when organizations prioritize AI adoption over core engineering rigor, as evidenced by recent high-profile outages.

Ahead of AI (Sebastian Raschka)Agents & Orchestration

Building and Auditing Local Coding Agents

A practical guide to setting up a local coding agent stack using Ollama and open-weight models, emphasizing performance benchmarking, secure auditing of agent harnesses, and the trade-offs of running local vs. proprietary infrastructure.

Latent Space (Newsletter)Agents & Orchestration

Claude Tag: Moving AI from Chat to Team-Based Delegation

Claude Tag shifts LLM interaction from synchronous chat to asynchronous, team-wide delegation within Slack, positioning Claude as a persistent, proactive coworker rather than a standalone tool.

Latent Space (Newsletter)Models & Frontier Labs

OpenAI's GPT-5.6 Launch: Frontier Models as Managed Assets

OpenAI released the GPT-5.6 family (Sol, Terra, Luna) as a restricted, government-mediated preview, signaling a shift where release governance is now a core component of the model specification.

Latent Space (Newsletter)Agents & Orchestration

Internal AI Adoption & The Rise of Agentic Workflows

OpenAI reports massive internal token growth across all departments, signaling that agentic workflows—supported by review loops and persistent infrastructure—are moving from experimental to core production patterns.

Together AI BlogInference & Serving

ParallelKernelBench: Frontier LLMs Struggle with Multi-GPU Kernels

While LLMs excel at single-GPU kernel generation, they currently struggle with multi-GPU tasks where communication bottlenecks and complex rank coordination dominate performance.

Hugging Face BlogInference & Serving

Deploying vLLM Endpoints on Hugging Face Jobs

Hugging Face Jobs allows engineers to spin up private, OpenAI-compatible vLLM endpoints on demand using a single command, providing a pay-per-second alternative for testing and experimentation.

arXiv cs.AIAI & LLMs

Tandem Reinforcement Learning: Aligning AI Reasoning with Humans

Tandem Reinforcement Learning (TRL) forces stronger models to co-generate reasoning with weaker models, resulting in more legible, robust, and human-compatible chains of thought without sacrificing performance.

Showing 30 of 1042