COMPASS: Improving Compositional Control in Multimodal Models

Bridging Perception and Generation with Intent Anchors

Current unified multimodal models often struggle to translate high-level visual intent—such as specific object placement or scene organization—into controllable generation. COMPASS addresses this by grounding composition-intent control within a single system. The core innovation is a shared expert token, denoted as $\tau_c$, which acts as a central intent anchor.

On the perception side, the system injects composition expertise into a Mixture-of-Experts (MoE) backbone. This process is designed to be minimally invasive, distilling complex visual composition analysis into the $\tau_c$ token. On the generation side, this same token is reused as a global conditioning signal to steer the denoising trajectory. This transformation effectively converts passive analysis into explicit, actionable layout control, allowing the model to follow compositional instructions more reliably.

Standardizing Compositional Evaluation

To address the lack of systematic benchmarks for this task, the authors introduced Comp-11, a large-scale dataset designed for instruction-following composition learning. Comp-11 features an 11-class taxonomy and includes reasoning-augmented annotations. This dataset allows for more rigorous evaluation of how well models understand and execute compositional requirements. Experimental results indicate that COMPASS significantly outperforms existing baselines in both category-level composition understanding and prompt-faithful image generation, demonstrating that grounding intent in a shared token architecture is a viable path toward more controllable multimodal systems.

Bridging Perception and Generation with Intent Anchors

Standardizing Compositional Evaluation

More from AI & LLMs

ReMMD: Agentic Verification for Multimodal Misinformation

SpatialClaw: Using Code as an Action Interface for Spatial Reasoning

Qwen-RobotSuite: Three Foundation Models for Embodied AI

Visual-Seeker: Active Visual Reasoning for Multimodal Agents