Reframing Treatment Reasoning as an Iterative Process
Treatment reasoning is inherently iterative, requiring the integration of disease context, comorbidities, and evolving biomedical knowledge. Traditional LLMs often struggle with this because they lack the ability to verify information against external, grounded sources. ATHENA-R1 addresses this by reframing treatment reasoning as a learnable process of iterative evidence gathering. The agent operates across a "universe" of 212 biomedical tools, allowing it to identify missing information, execute relevant tools, and synthesize evidence before reaching a conclusion.
A Two-Level Self-Learning Framework
The authors trained ATHENA-R1 without human-annotated traces by employing a two-level self-learning framework:
- Multi-Agent Construction: A system of agents generates the necessary tools, tasks, and reasoning trajectories to create a dataset for supervised fine-tuning.
- Reinforcement Learning (RL): The model is further refined using RL with scientific feedback. The reward function specifically targets reasoning quality, including evidence gathering, grounded tool use, and logical non-redundancy.
Performance and Clinical Validation
ATHENA-R1 demonstrates significant improvements over existing models across multiple benchmarks:
- Benchmark Performance: It achieved 94.7% accuracy on open-ended drug reasoning and 82.9% on patient treatment cases, outperforming GPT-5 by 17.8 and 10.7 percentage points, respectively.
- Expert Preference: In blinded evaluations by experts from 28 rare disease organizations, ATHENA-R1 was preferred over reference models across all criteria.
- Real-World Impact: The agent generated adverse-event hypotheses that were tested against electronic health records from 5.4 million patients. These hypotheses reached adjusted odds ratios of 1.48–1.84, demonstrating the model's ability to produce clinically actionable insights.