Static Benchmarks Fail Malleable AI Systems

Traditional software evals rely on unit tests, regression suites, CI/CD, and chaos engineering to measure static code. AI agents break this: they adapt to users, rewrite harnesses like OpenClaw (where Vincent Koc is a core contributor), and exhibit behavioral drift over time. Handcrafted datasets miss 20% edge cases that break products, test suites stale quickly, and production traces reveal issues benchmarks ignore. Result: hyperfocus on static benchmarks at conferences, yet systems ship with unmeasured chaos. Trade-off: offline evals ensure compliance (e.g., no illegal financial advice) but skip real-world stretching, leaving gaps until failures hit.

Chaos engineering—randomly breaking systems to find limits—applies here but lacks in AI. Software now malleables too, shipping at lightning speed; benchmarks can't keep up without adapting.

Shift from Prompt to Intent Engineering Compounds Eval Challenges

AI evolved: prompt engineering (random word-bashing for outputs, like accidental painkillers from liver meds) died by 2023. Context engineering added RAG, tools, search—enabling modular testing of agent parts (e.g., sales MCP tools). Now, 2025's intent engineering: cheap, fast tokens fuel self-optimizing agents understanding user intent via harnesses like OpenClaw, Claude, or CodeEx. Models solve human-hard ARC-AGI 2/3 puzzles via pattern recognition.

Problem: personalized experiences vary by user, making evals harder. Agents seem "insecure" without insight into layers. Need: measure ambiguity, personality rubrics (like art grading), not just 1+1=2.

Build Living Evals as Self-Optimizing Agents

Define end-state intent (e.g., user reward signal), let agents curate suites from traces: 80% traces repeat, but customer shifts trigger changes—agents detect, alert owners, update tests. Run online, always-on optimization; integrate telemetry (errors, costs) for self-correction—heal issues without prediction.

Applies broadly: auto-optimization like Python reward loops tunes anything (e.g., BBQ mixes). Evals become code/living agents, not datasets: 80% static intent-defined, 20% adaptive for weird queries. At Comet, they're implementing; mindset: treat evals agentically as problem/data shift.