Cascaded Pipelines Deliver Intelligence but Fail Human Latency

Cascaded systems—speech-to-text, LLM, text-to-speech—power useful voice agents today but can't mimic human conversations. Human responses total ~200ms across understanding, reasoning, and speaking; even fast TTS alone exceeds 200ms, and tool calls add 500ms-4s delays, making interactions laggy and robotic. Fillers help: while awaiting tool results, the LLM generates chit-chat (e.g., praising Tokyo's shrines during hotel search) to maintain flow, masking unpredictable latencies without solving the root architectural flaw.

Demos from top players like 11 Labs show smooth voices but high latency and no overlap handling, feeling scripted. Agents gain traction as LLMs get smarter, but voice merely wraps text intelligence, ignoring audio nuances.

Full-Duplex Unlocks Natural Overlap, but Stupidity Persists

Speech-to-speech models cut latency by skipping transcription but remain half-duplex (listen or speak, never both), breaking on backchanneling—overlaps like 'uh-huh' that fill 20% of human talk, signaling active listening (cultural in Japanese). Half-duplex demos frustrate: models halt on interruptions, demanding 'don't interrupt' despite humans doing it constantly.

Moshi, the first full-duplex model (~2 years old), handles overlaps robustly: users talk over it mid-response (e.g., querying ship supplies while it plots a Sirius trajectory), and it adapts without breaking. Nvidia's recent Personal Voice builds on it. Yet Moshi stays 'stupid'—no tools, agents, or tasks; pointless after minutes. Full-duplex flows indistinguishably human if trained on dialogue data, but lacks cascaded systems' reliability, observability, and intelligence. Path forward: infuse full-duplex with agent smarts.

Paralinguistics and Economics Block Emotional Depth

Voice encodes tone, hesitation, discomfort—paralinguistic cues lost in transcription. 'Her' nails this: Samantha infers unease from tone alone. Speech-to-speech retains cues but ignores them unless trained to respond (e.g., factual QA datasets teach nothing emotional). True 'Her' demands models exploiting voice for empathy, a science challenge beyond prompts.

Scaling kills: hours-long use burns cash on TTS (hyperscalers run at loss; startups exhaust funds pre-growth). LLM costs are negligible; TTS dominates. Privacy pushes on-device: Gradium Phonon (<1B params) runs on smartphone CPUs with voice cloning, outperforming priors like Coqui sans cloning. Enables consumer apps without API fees, total privacy, no cloud latency—demoed in Rick & Morty voice for smooth, local synthesis.