Production Pitfalls of One-Shot AI Features

Generic chatbots crumble under real use: users demand to-do lists, follow-up emails in their style, or meeting coaching, but LLMs misfire (e.g., confusing "coach me on meetings" with football). Web search, a single tool call in playgrounds, explodes in production—complex queries balloon token usage and context, costing 10p per chat at scale. Provider updates silently degrade results overnight, leaving no visibility into failures since billion-dollar search companies prove it's not trivial.

Single prompts can't adapt to roles: sales users want deal summaries, engineers need action items/blockers/Linear tickets, HR expects different outputs. LLMs act like black boxes, resisting molding without deep insight.

Custom Tracing Unlocks Iteration for All Teams

Build internal tracing wrapping LLM calls (e.g., via LangChain/ZK) to log tool calls, search trails, reasoning traces, and costs to a DB. Structure data precisely, then surface via a simple UI accessible to engineers, product, data, and CX—not buried in CloudWatch queries.

This exposes exact failure points: follow agent loops front-to-back to pinpoint why outputs feel off (e.g., bad tool call or degraded search). Founders and non-engineers now trace issues directly, enabling targeted fixes. Previously, SaaS providers were too rigid; now, one-shot these tools yourself for tailored control, outperforming generic OpenTelemetry setups.

Electron-to-Web Refactor Accelerates Testing

Desktop apps like Granola (Electron-based meeting notes with real-time transcription) limit parallel testing—one instance at a time, manual local runs/installs for PR reviews. Refactor by separating main (system APIs) and renderer (frontend) processes: abstract IPC to web standards (e.g., routers, sessions, queries via React APIs).

Deploy renderer as a web app; CI generates PR preview links for instant testing of variants. LLMs self-verify: Cursor auto-tests PRs, uploads screenshots, slashing manual effort. Result: experiment with UI/UX variants in practice (not Figma), ship polished features with conviction. Tauri was tested but skipped—no massive perf gains over Electron's API stability.

Feedback loops turn AI shipping into "tennis with LLMs": iterate like peers, making products feel magical rather than hoping one-shots connect with users.