The Need for Multi-Dimensional Medical Benchmarking
Existing medical AI benchmarks are fragmented, often focusing on single-turn QA tasks or text-only dialogues. IMCBench addresses this by introducing an image-grounded, multi-turn conversation framework that simulates realistic patient-clinician interactions. It uses real clinical images paired with synthetic patient profiles to evaluate models across three critical clinical dimensions: safety, accuracy, and the appropriate use of uncertainty in diagnosis.
Performance and Safety Trade-offs
The benchmark evaluated eight frontier models (Claude, GPT, Nova, and Llama) using an LLM-as-Jury scoring system calibrated by expert clinicians. Claude Opus 4.6 led with a score of 3.61, followed by Claude Sonnet 4.6 (3.30) and GPT-5.2 (3.29).
Crucially, the study identifies a significant safety degradation when models encounter malignant or rare conditions, with safety scores dropping by 0.27 in these scenarios. The research highlights that high performance in clinical description does not inherently translate to safe patient guidance, suggesting that current models struggle to maintain safety guardrails as the complexity of the medical case increases.
The Role of Multimodal Context
Ablation studies demonstrate that both visual inputs and Electronic Health Record (EHR) context are essential for safe clinical guidance. Removing visual input resulted in an average safety drop of 0.18, while removing EHR context caused a drop of 0.23. The findings indicate that stronger models are more effective at integrating these visual features, reinforcing the necessity of multi-modal, context-rich evaluation frameworks for clinical AI deployment.