The Shift from Static QA to Longitudinal Evolution
Traditional medical AI benchmarks rely on fixed-input question answering, which fails to capture the reality of clinical practice. In a real-world setting, a doctor agent must interactively gather evidence, order examinations, and consult with others before finalizing a diagnosis. MedEvoEval introduces an executable, longitudinal evaluation framework designed to measure how these agents evolve through memory, retrieval, and reflection over time.
Core Components of the MedEvoEval Framework
MedEvoEval uses action-gated simulated outpatient episodes to create a realistic testing environment. Key features include:
- Role-Specific Views: Each case is segmented into distinct patient, examination, and manager perspectives, forcing the agent to navigate information access.
- Action-Gated Evidence: Information is only revealed when the agent performs a valid action, preventing the model from "cheating" by accessing the full case file prematurely.
- Structured Traces: Every episode generates a detailed log linking observations, actions, final outputs, and manager scores. This allows developers to analyze process costs—such as unnecessary testing—that are often hidden by simple final-answer scoring.
Evaluating Agent Maturation
By providing 700 processed episodes, the framework enables researchers to perform longitudinal analysis on agent behavior. It specifically targets four critical areas of agent development:
- Memory Maturation: How well the agent stores and utilizes information from past encounters.
- Held-out Transfer: The ability to apply learned clinical reasoning to new, unseen patient cases.
- Update-Stage Response: How the agent adjusts its internal state or strategy based on feedback or new experience.
- Backward Retention: Ensuring that as the agent learns new information, it does not lose or degrade its ability to perform previously mastered tasks.
This framework provides a concrete, reproducible way to determine if an agent is truly improving through experience or merely memorizing data, offering a more rigorous standard for clinical AI deployment.