Interaction Models: Native Real-Time Multimodal AI

Turn-Based AI Harnesses Limit Collaboration—Native Interactivity Scales Better

Current AI relies on turn-based loops where models process complete inputs before responding, freezing perception during generation or user input. This narrow channel blocks rich cues like mid-sentence pauses, visual changes, or unstated intent. Workarounds like voice-activity detection (VAD) use less intelligent components, preventing proactive reactions or simultaneous speech/listening.

Interaction models fix this by baking continuous interactivity into the model core, following the 'bitter lesson': general capabilities outpace hand-crafted systems. A 276B parameter Mixture-of-Experts (MoE) model with 12B active parameters (TML-Interaction-Small) processes audio/video/text in parallel streams, reacting visually without prompts (e.g., counting pushups on camera) or interjecting verbally mid-sentence.

Dual-Model Micro-Turns Enable Seamless Real-Time Flow

Split workload: an always-on interaction model handles live conversation (interruptions, backchanneling) via time-aligned 200ms chunks of input/output interleaving. For deep tasks (tool calls, web search), it delegates full conversation context to a background model, interleaving streaming results without abrupt switches—like a colleague passing notes mid-chat.

Multimodality uses encoder-free early fusion: audio as dMel via lightweight embeddings, video as 40x40 patches via hMLP, output via flow head—all co-trained with the transformer, skipping heavy pretrained encoders like Whisper. Inference optimizes via streaming sessions (upstreamed to SGLang) appending chunks to persistent GPU sequences, plus gather+gemv MoE kernels for low-latency bidirectional serving.

This yields built-in capabilities: simultaneous speech (e.g., live translation), time-aware initiation (speak at specified times), concurrent tools/UI generation, and visual proactivity (flag code bugs on-screen).

Leads Benchmarks on Interaction and Proactive Tasks

TML-Interaction-Small tops instant models: 43.4% Audio MultiChallenge APR (vs. GPT-realtime-2.0 minimal 37.6%, Gemini-3.1-flash-live-preview minimal 26.8%); 77.8 average FD-bench v1.5 interaction quality (vs. Gemini 54.3, GPT-realtime-2.0 xhigh 47.8); 0.40s FD-bench v1 turn latency (vs. Gemini 0.57s); 82.8% FD-bench v3 response quality / 68.0% Pass@1 with background agent.

New benchmarks expose gaps in rivals (near-zero scores): TimeSpeak 64.7 macro-accuracy (vs. GPT 4.3); CueSpeak 81.7 (vs. 2.9); RepCount-A 35.4 off-by-one visual counting (vs. 1.3); ProactiveVideoQA 33.5 PAUC@0.5 (vs. 25.0 baseline); Charades 32.4 mIoU temporal localization (vs. 0).

Preview Access and Trade-offs for Builders

Limited research preview via thinkingmachines.ai (apply for access, grants for benchmarks). Not production-ready: long sessions bloat context, needs reliable networks for 200ms streams, larger models pending for 2026. Safety at 99.0% Harmbench refusal. Use for human-AI collaboration research; contribute benchmarks to advance interactivity evals.

Turn-Based AI Harnesses Limit Collaboration—Native Interactivity Scales Better

Dual-Model Micro-Turns Enable Seamless Real-Time Flow

Leads Benchmarks on Interaction and Proactive Tasks

Preview Access and Trade-offs for Builders

More from AI & LLMs

RL Industrializes GenAI Production via Feedback Loops

AI Glossary: Master Terms for Building with LLMs

Wrap Existing Chat Agents in Voice with ElevenLabs Engine

Claude Dreaming Boosts Agents 5.4x on Repeat Tasks