#machine-learning
Every summary, chronological. Filter by category, tag, or source from the rail.
Steering LLM Personality via Latent Feature Interventions
Researchers have developed a mechanistic method to steer LLM personality traits by identifying and modifying latent features in the model's residual stream using sparse autoencoders, enabling precise behavioral control without retraining.
MedEvoEval: A Longitudinal Framework for Evaluating Doctor Agents
MedEvoEval is a new evaluation framework that moves beyond static medical QA by testing how doctor agents learn, retain, and adapt clinical decision-making skills across sequences of simulated outpatient episodes.
HyphaeDB: Moving From Passive Storage to Agent-Native Memory
HyphaeDB reinterprets HNSW graph topology as a communication fabric for multi-agent systems, enabling knowledge propagation and emergent consensus rather than just passive retrieval.
ComMem: Dual-Memory Systems for VLM Test-Time Adaptation
ComMem improves VLM robustness by mimicking biological memory, using a fast-adapting visual cache and a slow-integrating textual prototype system to maintain cross-modal consistency during test-time adaptation.
Making LLM Self-Evolution Safe with Held-Out Selection
RSEA improves LLM agent performance by recursively evolving natural-language artifacts while using a strict held-out validation gate to prevent performance regression.
IMCBench: Evaluating Multimodal LLMs in Clinical Conversations
IMCBench is a new multi-turn, image-grounded benchmark for medical AI that reveals a critical gap: accurate clinical descriptions do not guarantee safe patient guidance.
ATHENA-R1: An AI Agent for Iterative Biomedical Treatment Reasoning
ATHENA-R1 is an AI agent that performs iterative treatment reasoning by dynamically querying a universe of 212 biomedical tools, outperforming GPT-5 by significant margins in clinical benchmarks.
Closing the Loop Between Model Evaluation and Data Intervention
By introducing 'capability slices'—groups of evaluation samples categorized by task and operation—engineers can transform benchmark failures into precise, actionable data interventions rather than relying on intuition.
GPTNT: A Real-Time Collaborative Benchmark for AI Agents
GPTNT uses the game 'Keep Talking and Nobody Explodes' to test AI agent collaboration under time pressure, revealing critical failures in state tracking and real-time communication.
Building Real-Time Industrial Digital Twins with AI
Modern digital twins must move beyond static dashboards to active, predictive systems that simulate and anticipate factory operations using real-time streaming data.
Building a Text-JEPA Model from Scratch
Text-JEPA moves away from auto-regressive token prediction by learning world model representations in latent space, offering a potential path toward more efficient, non-generative intelligence.
Ornith-1.0: Coding Models That Learn Their Own Harness
Ornith-1.0 achieves state-of-the-art performance for its size by incorporating the coding harness into the model's training gradient, allowing the model to dynamically generate its own execution scaffolds rather than relying on static, human-written ones.
How Arena Scaled AI Evaluation to $100M ARR
Arena, the crowdsourced AI leaderboard, reached $100M in annualized revenue by pivoting from a research project to a commercial platform providing deep-dive performance analytics to model labs.
Building Private Legal AI Infrastructure with Knowledge Graphs
Stephen Costigan argues that law firms should shift from renting generic AI tools to building private, firm-owned knowledge graphs to secure privileged data and create durable, differentiated legal intelligence.
AI Infrastructure, Robotics, and the Future of Human Agency
Recent developments in robotics, large-scale training diagnostics, and legal informatics highlight the rapid maturation of AI infrastructure, while historical and philosophical perspectives caution against overconfidence in predicting AI's societal trajectory.
Architecting an Agent-Native Immune System (ANIS) for AI Security
The Agent-Native Immune System (ANIS) moves security from external training-time alignment to an endogenous, runtime defense architecture that protects autonomous agents from hijacking and manipulation.
ATOD: Hybrid Training for High-Performance AI Agents
ATOD combines on-policy distillation with reinforcement learning to overcome the performance ceiling of imitation learning, using an annealed schedule and turn-level reweighting to improve long-horizon agent training.
Tree of Evidence: Hierarchical Fact-Checking Against AI Misinformation
ToE (Tree of Evidence) is a hierarchical framework that combats AI-generated misinformation by decomposing claims into dynamic argument trees, using reinforcement learning to retrieve and verify evidence across multiple sources.
Mitigating Rollout Error in Graph World Models
Graph World Models (GWMs) face unique long-horizon errors where local inaccuracies propagate through topology. The Error-Aware GWM framework uses spectral regularization and critical-node weighting to maintain stability during dynamic-edge rollouts.
Improving LLM Planning with Symbolic Feedback Loops
To solve LLM planning errors in long-horizon tasks, this framework uses symbolic verification to provide corrective, interpretable feedback, forcing the model to iteratively refine its plans.
Reducing LLM Agent Hallucinations with Grounded Iterative Planning
Grounded Iterative Language Planning (GILP) combines LLM reasoning with a lightweight, trained transition predictor to catch and correct hallucinated state changes, significantly improving planning accuracy.
ODYSSEY: A Categorical Framework for Verifiable AI Models
ODYSSEY introduces a categorical framework using 'foundries'—modular, verifiable building blocks—to construct foundation models that maintain local truth and allow for rigorous, queryable knowledge management.
Internalizing Future-Aware Planning in LLM Agents
Standard LLM agents are reactive; this research introduces a three-stage training pipeline to enable genuine 'what-if' reasoning by internalizing world models within autoregressive policies.
Mastering Probability Distributions for Machine Learning
Probability distributions are maps of data behavior. Understanding them allows you to select better models, engineer features effectively, and quantify uncertainty in production pipelines.
Why R-Squared Misleads and How to Properly Evaluate Regression
R-squared measures explained variance but ignores model complexity and outliers. To truly understand model performance, you must use a suite of metrics—MAE, MSE, RMSE, and Adjusted R-squared—to identify where your model fails and why.
Improving Uncertainty Estimation for Classifier Performance
Standard confidence interval methods often fail for small datasets or high-performance models; using Agresti-Coull, Wilson, or regularized bootstrap methods significantly improves accuracy.
The Critical Gaps in Multimodal LLM Evaluation
Current MLLM benchmarks rely on isolated tasks that fail to measure true cross-modal integration, missing key capabilities like temporal-spatial coherence and physical world reasoning.
The Verification Horizon: Why Coding Agents Need Evolving Rewards
As AI coding agents improve, generating code becomes easier than verifying it. Because no static reward function can perfectly capture human intent, verification must co-evolve with model capabilities to prevent reward hacking.
COrigami: AI-Driven Design for Flat-Foldable Origami
COrigami is an end-to-end AI pipeline that translates natural language into mathematically valid, flat-foldable origami crease patterns by combining geometric optimization with reinforcement learning-based aesthetic refinement.
Refusal in LLMs is Gated by Persona
Refusal behavior in chat models is not an isolated mechanism; it is downstream of the model's persona. Steering a model toward a compliant persona can suppress refusal rates from 97% to 2%.
Showing 30 of 334