Centroid Similarity Powers Zero-Cost Classification

NadirClaw classifies prompts locally before any LLM call by embedding them with sentence-transformers/all-MiniLM-L6-v2 (384-dim vectors, normalized) and computing cosine similarity to fixed 'simple_centroid.npy' and 'complex_centroid.npy'. Routing score derives from sim_complex - sim_simple: scores >=0.5 route to complex (Pro), <0.5 to simple (Flash). Confidence below default 0.06 threshold forces complex for safety. Cosine between centroids is low (~0.15-0.3 typical), enabling separation.

Example on 10 prompts:

  • Simple (Flash): 'What is 2+2?', 'Format this JSON', score=0.00-0.28, conf=0.85-0.99
  • Complex (Pro): 'Design distributed pipeline 50k req/s', 'Prove scheduling optimal', score=0.52-0.78, conf=0.12-0.92

Modifiers override: agentic (tools=), vision (image_url), reasoning ('step by step') trigger Pro regardless.

To replicate: pip install nadirclaw sentence-transformers; nadirclaw classify --format json "prompt" returns {'tier': 'simple', 'score': 0.23, 'confidence': 0.91, 'model': 'gemini-2.5-flash'}.

Visualize and Tune Routing Boundaries

Embed prompts, plot sim_simple (x) vs sim_complex (y): simple cluster below dashed y=x line (Flash), complex above (Pro). Annotate points 1-10 for inspection; blue=expected simple, red=complex.

Sort by score for bar viz: bars scale 0-1, e.g., '2+2' score=0.02 |█░░░░░░░░|, 'Prove optimal' score=0.78 |█████████████████████████████|. Threshold sweep: at 0.02, 2/10 force-complex (low conf); at 0.30, 8/10; natural complex=5/10 by score>=0.5. Raise threshold to escalate uncertain prompts, trading cost for reliability.

Load centroids: import nadirclaw; PKG=Path(nadirclaw.__file__).parent; SIMPLE_C=np.load(PKG/'simple_centroid.npy')—norm ~1.0, shapes (384,).

Proxy Deployment Yields Measurable Cost Savings

Run nadirclaw serve --verbose (port 8856, env GEMINI_API_KEY, NADIRCLAW_SIMPLE_MODEL=gemini-2.5-flash, _COMPLEX=gemini-2.5-pro). OpenAI-compatible: client=OpenAI(base_url='http://localhost:8856/v1', api_key='local'); client.chat.completions.create(model='auto', ...).

Side-by-side: simple docstring -> Flash (0.8s), complex refactor -> Pro (2.1s). Mixed 10-prompt workload: 6/10 Flash ('Capital France?', 'Type hint'), 4/10 Pro ('Design sharded KV', 'B-trees vs LSM'); latencies 0.4-2.3s.

Pricing (/1M tokens): Flash in=$0.30 out=$2.50; Pro in=$1.25 out=$10.00. Example run: routed $0.000248 vs always-Pro $0.000842, 70.5% savings. nadirclaw report parses JSONL logs for aggregates.

Stop: SIGTERM process group. Trade-offs: local class ~100ms (warm), zero LLM cost pre-route; Flash fails rare complex (e.g., proofs), but conf threshold catches most; transparent via centroids vs blackbox routers.