Separate Evaluation, Detection, and Translation to Ground LLM Outputs

LLMs hallucinate in chess because they predict tokens from language data, not calculate positions accurately—they falter after openings, as seen in Kaggle's LLM chess tournament where models like GPT lost badly. Bridge this with a pipeline: Run Stockfish (top classical engine) across the full game for best-move evaluations and lines. Extract structured context via detectors for tactics (forks, pins, skewers) and positional themes (doubled pawns, structural weaknesses). Add Maya (University of Toronto neural engine) for human-like move probabilities by rating (e.g., 1500 Elo), revealing if a Stockfish-optimal move is hard to spot (low probability). Feed this JSON to an LLM solely as translator, preventing reasoning errors. Result: Nuanced commentary like "F5 threatens to trap your queen (Bg5 line), but capture the pawn to defend and escape," explaining threats, defenses, and plans beyond "bad move."

This grounds explanations in facts, enabling brilliant-move detection (e.g., knight sac to checkmate) with why-it-works details, plus game-phase accuracy, ratings, and opening depth insights for coaching.

Close Feedback Loops with Autonomous Agents for Rapid Iteration

User flags bad commentary in-app, posting position, PGN, and output to Slack while injecting into a running Claude Code session via Channels (Anthropic's research-preview MCP for event injection, like OpenClaw). Claude invokes a custom "commentary triage skill": Analyzes position with Stockfish/detectors, modifies prompts or adds detectors (e.g., new tactic), regenerates commentary, self-verifies, and queries Slack ("What specifically feels wrong?"). Operator approves from phone, triggering PR merge. Live demo showed Claude dismissing a false flag ("nothing wrong") autonomously. Humans-in-loop but mobile-scale; prunes massive initial JSON context iteratively for quality gains.

Balance Sub-3s Latency and Quality via Model Evals and Trade-offs

Consumer apps demand instant post-game reviews (cycle moves one-by-one), so target sub-3s end-to-end: Gemini 1.5 Flash delivers ~1s time-to-first-token, ~3s average total. Avoid reasoning models' unpredictable delays (no spinning "coach thinking" screens). Quality via 16 thematic evals (tactics, blunders, anti-hallucination) from real games: LLM-as-judge on OpenRouter for quick model swaps (Gemini Flash: 75% pass; Claude thinking: 60% but slower; GPT-4o mini: lower accuracy, moderate latency). Human experts (speakers, strong players) final-check against manual calculation. Future: Deeper chat-coach tolerates latency for reasoning models.

Key learnings: Isolate data pipelines (Stockfish/detectors) from LLM generation for speed/reliability; build clear context extractors (start big, prune); leverage domain-expert evals/partners; agent loops enable bus-to-PR iteration.