The Shift from Static QA to Longitudinal Evolution

Traditional medical AI benchmarks rely on fixed-input question answering, which fails to capture the reality of clinical practice. In a real-world setting, a doctor agent must interactively gather evidence, order examinations, and consult with others before finalizing a diagnosis. MedEvoEval introduces an executable, longitudinal evaluation framework designed to measure how these agents evolve through memory, retrieval, and reflection over time.

Core Components of the MedEvoEval Framework

MedEvoEval uses action-gated simulated outpatient episodes to create a realistic testing environment. Key features include:

  • Role-Specific Views: Each case is segmented into distinct patient, examination, and manager perspectives, forcing the agent to navigate information access.
  • Action-Gated Evidence: Information is only revealed when the agent performs a valid action, preventing the model from "cheating" by accessing the full case file prematurely.
  • Structured Traces: Every episode generates a detailed log linking observations, actions, final outputs, and manager scores. This allows developers to analyze process costs—such as unnecessary testing—that are often hidden by simple final-answer scoring.

Evaluating Agent Maturation

By providing 700 processed episodes, the framework enables researchers to perform longitudinal analysis on agent behavior. It specifically targets four critical areas of agent development:

  • Memory Maturation: How well the agent stores and utilizes information from past encounters.
  • Held-out Transfer: The ability to apply learned clinical reasoning to new, unseen patient cases.
  • Update-Stage Response: How the agent adjusts its internal state or strategy based on feedback or new experience.
  • Backward Retention: Ensuring that as the agent learns new information, it does not lose or degrade its ability to perform previously mastered tasks.

This framework provides a concrete, reproducible way to determine if an agent is truly improving through experience or merely memorizing data, offering a more rigorous standard for clinical AI deployment.