Long-Context Models

AI concept developing
1 source

Long-Context Models

Long-context models are LLM architectures and techniques designed to process context windows far beyond the standard 4K–128K token range, scaling to millions or even hundreds of millions of tokens. The core challenge: standard self-attention has quadratic complexity — doubling the context length quadruples the computation — making naive scaling to long contexts prohibitively expensive.

Why Long Context Matters

Many real-world tasks require reasoning over large amounts of information:

  • Multi-document analysis — legal review, research synthesis, financial analysis across many filings
  • Persistent agent memoryAI Agents that maintain state across sessions or long task sequences
  • Codebase understanding — large codebases can exceed millions of tokens
  • Digital twins and personas — maintaining consistent character knowledge over extended interactions

Current solutions (primarily RAG — Retrieval-Augmented Generation) retrieve relevant snippets from a large corpus and inject them into a standard-length context. This works but has fundamental limitations: the retrieval pipeline is separate from the model, not jointly optimized, and relies on surface-level similarity rather than deep understanding of relevance.

Memory Paradigms

The MSA paper categorizes approaches to LLM memory into three paradigms:

Parameter-Based Memory

Internalize knowledge by updating model weights (LoRA, continual pretraining) or using test-time training. Offers deep semantic integration but suffers from catastrophic forgetting and limited capacity — the model can only “remember” what fits in its parameters.

External Storage (RAG)

Retrieve relevant documents from a separate database and inject them into the prompt. Scales to arbitrary corpus sizes and avoids forgetting, but the retrieval pipeline is not end-to-end trainable — it uses separate embeddings that may not align with the model’s internal representation space, creating a precision ceiling.

Latent State-Based Memory

Construct memory directly from the model’s internal representations (KV caches, hidden states). This operates in the model’s native representation space, offering high semantic fidelity. But standard KV-based approaches have prohibitive computational costs at scale, while linear attention variants (RWKV, DeltaNet) compress history into fixed-size states that inevitably lose information.

Memory Sparse Attention

MSA represents a synthesis of these paradigms. Key innovations:

  • Sparse attention with learned routing. A trainable router selects the top-k most relevant documents from a memory bank at each attention layer. This is end-to-end differentiable — the retrieval is jointly optimized with generation.
  • Document-wise positional encoding. Each document gets independent positional encodings (resetting to position 0), allowing the model to train on 64K contexts but extrapolate to 100M tokens at inference.
  • KV cache compression. Documents are compressed via chunk-wise mean pooling, reducing storage to manageable levels (routing keys on GPU, content KV on CPU).
  • Memory Interleave. For multi-hop reasoning requiring information from scattered documents, the model iteratively retrieves and integrates evidence rather than relying on a single retrieval pass.

The result: less than 9% performance degradation when scaling from 16K to 100M tokens, where frontier models and RAG systems collapse.

Connection to Inference Optimization

Long-context capabilities directly relate to LLM Inference Optimization: processing 100M tokens requires efficient attention kernels, KV cache management, and memory hierarchy optimization. The techniques developed for long-context models (sparse attention, KV compression, tiered storage) are broadly applicable to inference efficiency challenges.

Sources