#research
Every summary, chronological. Filter by category, tag, or source from the rail.
MedEvoEval: A Longitudinal Framework for Evaluating Doctor Agents
MedEvoEval is a new evaluation framework that moves beyond static medical QA by testing how doctor agents learn, retain, and adapt clinical decision-making skills across sequences of simulated outpatient episodes.
Specialized Clinical AI Outperforms General Models in Real-World Use
A study of 620 real-world clinical queries shows that specialized AI tools significantly outperform general-purpose models across accuracy, utility, and verifiability, highlighting the need for domain-specific evaluation.
ComMem: Dual-Memory Systems for VLM Test-Time Adaptation
ComMem improves VLM robustness by mimicking biological memory, using a fast-adapting visual cache and a slow-integrating textual prototype system to maintain cross-modal consistency during test-time adaptation.
IMCBench: Evaluating Multimodal LLMs in Clinical Conversations
IMCBench is a new multi-turn, image-grounded benchmark for medical AI that reveals a critical gap: accurate clinical descriptions do not guarantee safe patient guidance.
GPTNT: A Real-Time Collaborative Benchmark for AI Agents
GPTNT uses the game 'Keep Talking and Nobody Explodes' to test AI agent collaboration under time pressure, revealing critical failures in state tracking and real-time communication.
Building a Text-JEPA Model from Scratch
Text-JEPA moves away from auto-regressive token prediction by learning world model representations in latent space, offering a potential path toward more efficient, non-generative intelligence.
Agentic Robotics, Large-Scale Infra, and Future Uncertainty
Recent developments in agentic robot self-improvement, large-scale GPU cluster telemetry, and legal data infrastructure highlight the rapid maturation of AI systems, even as experts debate the long-term implications for human autonomy.
Internalizing Future-Aware Planning in LLM Agents
Standard LLM agents are reactive; this research introduces a three-stage training pipeline to enable genuine 'what-if' reasoning by internalizing world models within autoregressive policies.
Personality Prompting in Multi-Agent Teams: Impact vs. Task Structure
Personality manipulation in LLM agents significantly alters communication style but only degrades performance in open-ended or competitive tasks, while having negligible impact on structured coding tasks.
The Shift from Chatbots to Agentic Workflows
OpenAI's internal data shows a transition from short-horizon chatbot interactions to long-horizon agentic tasks, with non-technical departments adopting agents faster than engineers to perform cross-functional work.
Improving LLM Ethical Reasoning with Narration-of-Thought
Narration-of-Thought (NoT) is an inference-time prompting scaffold that forces LLMs to explicitly identify stakeholders and uncertainties before committing to a decision, significantly reducing common ethical reasoning failures.
The Critical Gaps in Multimodal LLM Evaluation
Current MLLM benchmarks rely on isolated tasks that fail to measure true cross-modal integration, missing key capabilities like temporal-spatial coherence and physical world reasoning.
Refusal in LLMs is Gated by Persona
Refusal behavior in chat models is not an isolated mechanism; it is downstream of the model's persona. Steering a model toward a compliant persona can suppress refusal rates from 97% to 2%.
Beyond Accuracy: Evaluating AI Agents After Benchmark Saturation
When AI benchmarks saturate, accuracy becomes a poor metric. Researchers should instead evaluate agents across six dimensions: construct validity, generalizability, efficiency, reliability, model/scaffold performance, and human-agent collaboration.
Hybrid vs. Transformer: Token-Level Performance Analysis
Hybrid models outperform transformers on meaning-bearing content words due to superior state-tracking, while transformers retain a distinct advantage in verbatim token repetition and exact recall tasks.
The Miranda Hypothesis: Why Persona Evals Fail
Current persona-based AI benchmarks measure 'convincingness' rather than historical fidelity, leading to 'Miranda distortion' where models prioritize culturally dominant narratives (like the Hamilton musical) over primary documentary records.
AI EngineerWhy Static Word Embeddings Fail at Contextual Meaning
Early NLP systems treated words as fixed, singular vectors, ignoring polysemy. This design flaw caused systemic errors by failing to distinguish between different meanings of the same word based on context.
T2D-Bench: Evidence-Gated Evaluation for Clinical LLM Accuracy
T2D-Bench uses a multi-layer knowledge graph to detect and correct unsupported clinical omissions in LLM outputs, revealing that even top-tier models fail to meet evidence-based constraints in over 30% of cases.
Automating Mechanistic Interpretability with Agentic Loops
The HyVE agentic framework automates circuit explanation by iterating through observation, hypothesis generation, and causal validation, though reliable validation remains the primary bottleneck.
Defining True Agency: Agentic vs. Agentive Systems
Current 'AI agents' are merely engineered workflows. True agency requires internalizing goal-setting, identity, and self-regulation within the system, rather than relying on external scaffolding.
Mapping AI’s Impact on the European Labor Market
OpenAI’s new framework categorizes EU jobs into four transition archetypes to help policymakers and firms anticipate AI-driven labor shifts before they appear in aggregate statistics.
Memory Caching: Bridging RNN Efficiency with Transformer Recall
Google's 'Memory Caching' architecture proposes a hybrid approach that allows recurrent models to maintain a growing memory, potentially overcoming the quadratic scaling costs of Transformers while retaining long-context retrieval capabilities.
GLARE: Natural Language Interfaces for Global Model Explanations
GLARE provides a natural language interface for querying global model explanations, allowing users to interpret complex AI behavior through conversational prompts rather than static visualizations.
Moving Beyond Static Leaderboards for LLM Agent Evaluation
Static benchmarks often fail to predict real-world performance for LLM agents; the authors propose a framework focused on predictive validity to better align evaluation with practical utility.
The Symbiotic Evolution of AI and Software Engineering
The intersection of AI and Software Engineering (AI4SE and SE4AI) has matured over the last decade, shifting from experimental research to essential production-grade methodologies for building, testing, and maintaining complex systems.
Optimizing LLM Post-Training Through Pairwise Comparison Selection
The paper investigates how the selection of response pairs in preference-based post-training (like DPO or PPO) impacts model performance, suggesting that strategic pair selection is as critical as the training algorithm itself.
Detecting LLM Epistemic Blind Spots via Cross-Model Attribution
LLMs often hallucinate confidence in clinical settings. This paper introduces a method using Cross-Model Attribution Divergence (CMAD) to identify when models rely on unreliable features, effectively flagging epistemic uncertainty in tabular data.
Deontic Policies for Runtime Governance of Agentic AI
The paper proposes using deontic logic—a system of formal rules defining obligations, permissions, and prohibitions—to govern the runtime behavior of autonomous AI agents.
Accelerating Scientific Discovery with LLM-Driven Hypothesis Testing
Immunologist Derya Unutmaz demonstrates how LLMs act as research collaborators by identifying hidden biological mechanisms in experimental data and simulating outcomes to prioritize high-value lab work.
SciRisk-Bench: Evaluating Safety in AI for Science
SciRisk-Bench is a new benchmark designed to evaluate the safety risks of AI models specifically applied to scientific research, focusing on multi-dimensional risk assessment.
Showing 30 of 179