Source: MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

AI source
Yu Chen, Runkai Chen, Sheng Yi Mar 29, 2026
PublishedMarch 29, 2026
AuthorYu Chen, Runkai Chen, Sheng Yi
Sourcearxiv.org

Source: MSA Memory Sparse Attention

Summary

This paper from Evermind/Shanda Group presents Memory Sparse Attention (MSA), an end-to-end trainable architecture that scales LLM memory to 100 million tokens — approaching the estimated capacity of human lifelong memory (~200–300M tokens). MSA achieves this by decoupling memory retrieval from reasoning: documents are pre-encoded into compressed key-value representations, and a learned sparse attention mechanism selects the most relevant documents for each query. The system maintains less than 9% performance degradation when scaling from 16K to 100M tokens, while existing models and RAG systems catastrophically degrade.

Key Claims

  • End-to-end trainable retrieval. Unlike RAG systems that use separate, non-differentiable retrieval pipelines, MSA integrates retrieval into the attention mechanism itself. The router projectors that select relevant documents are trained jointly with the language model, eliminating the optimization gap between retrieval and generation.
  • Linear complexity at scale. Through KV cache compression (chunk-wise mean pooling) and sparse top-k document selection, MSA achieves near-linear time complexity in both training and inference. This makes 100M-token contexts computationally tractable.
  • Document-wise RoPE enables extrapolation. By assigning independent positional encodings to each document (resetting position IDs to 0), the model can train on 64K-token contexts but extrapolate to 100M tokens at inference. Standard global positional encodings would fail at context lengths far beyond training.
  • 100M tokens on 2 GPUs. A tiered memory architecture stores routing keys on GPU and content KV pairs on CPU, with on-demand loading. This enables inference over 100M tokens on a single 2xA800 node.
  • Memory Interleave for multi-hop reasoning. An adaptive mechanism alternates between retrieval and generation, allowing the model to iteratively gather evidence from scattered documents. Unlike single-shot retrieval, this handles queries that require combining information from multiple sources.
  • Outperforms RAG and frontier models. MSA surpasses SOTA RAG systems (which rely on separate retrieval + reranking pipelines) and frontier LLMs (Qwen2.5-14B-1M, Qwen3) on long-context QA benchmarks. On Needle-In-A-Haystack tests, MSA maintains performance where other approaches collapse.
  • Three paradigms of LLM memory compared. The paper categorizes existing approaches: parameter-based (LoRA, continual pretraining — limited capacity, catastrophic forgetting), external storage (RAG — not end-to-end, precision ceiling), and latent state (linear attention — fixed capacity, lossy). MSA combines the benefits: end-to-end training, scalable capacity, no catastrophic forgetting.

Relevance and Implications

MSA represents a potential paradigm shift in how LLMs handle long-term memory. Rather than relying on external retrieval systems (RAG) or fixed-size context windows, MSA integrates memory directly into the attention mechanism while maintaining the ability to scale to arbitrary lengths. This has implications for long-context applications including digital twins with stable personas, multi-session agent reasoning, and large-corpus summarization. The architecture’s compatibility with mainstream Transformer models means it could be integrated into existing LLMs without fundamental architectural changes.

Sources