#benchmarking
Every summary, chronological. Filter by category, tag, or source from the rail.
Tag · #benchmarking
WorldLines: Benchmarking Long-Horizon Stateful Embodied Agents
WorldLines introduces a new benchmark and modeling framework designed to evaluate how embodied AI agents maintain state and execute complex, long-horizon tasks over extended periods.
arXiv cs.AI
SentinelBench: Evaluating Long-Running AI Monitoring Agents
SentinelBench provides a standardized framework for evaluating AI agents tasked with continuous, long-running monitoring, addressing the critical gap in testing agent reliability over extended time horizons.
arXiv cs.AI
Why Micro-Benchmarks Often Fail to Predict Production Performance
Benchmarks often report false improvements because they measure performance under ideal conditions—like warm caches—that rarely exist in real-world production environments.
Level Up Coding
Showing 3 of 3