The Evolution of Context Handling

As LLM context windows have expanded—from GPT-3's 1,000 tokens to Gemini 1.5 Pro's 2 million tokens—the strategy for providing external knowledge has shifted. While Retrieval Augmented Generation (RAG) remains a standard for massive datasets, developers can now choose between direct long-context injection and Cache Augmented Generation (CAG).

Long Context: Simplicity vs. Cost

Long context involves stuffing all relevant documents directly into the prompt.

  • Pros: Extremely simple to implement; no retrieval infrastructure required; eliminates the risk of missing relevant information due to poor retrieval.
  • Cons: High cost, as every query incurs the full token processing fee; increased latency; and the "lost in the middle" effect, where models struggle to retrieve information buried in the center of a long prompt.
  • Best Use Case: One-off tasks, such as analyzing a single document or answering a few questions about a specific set of data that won't be queried again.

Cache Augmented Generation (CAG)

CAG optimizes performance by leveraging the model's Key Value (KV) cache—the internal representation of how a model encodes text. Instead of recomputing this cache for every query, CAG performs a one-time pre-computation.

  • The Three-Phase Process:
    1. Knowledge Preparation: Formatting documents to fit the context window.
    2. Pre-computation: Processing the documents once to generate and persist the KV cache.
    3. Inference: Loading the pre-computed cache for each query, which can result in 10x to 40x speedups compared to full reprocessing.
  • Limitations: The knowledge base must fit within the context window, and any change to the source documents requires a full recomputation of the cache, making it best suited for stable data (e.g., company policy bots).

Prompt Caching as a Managed Service

Modern LLM providers now offer "prompt caching" as a built-in API feature, effectively acting as CAG-as-a-Service. This allows developers to send a long system prompt once and reuse the cached prefix for subsequent requests. This approach significantly lowers costs, often providing up to a 90% discount on cached token processing compared to fresh requests.