Venue: Preprint
Year: 2026

SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers

Authors: Huzama Ahmad, Se-Young Yun

long-context
efficiency
sparse-attention

Abstract

Long contexts have become standard in pretrained LLMs, yet they remain expensive to run: prefill compute grows quadratically with sequence length, and every decode step re-reads a key-value cache that grows linearly with it. Sparse attention cuts these costs by attending only to a relevant subset of past tokens, but selecting that subset is itself expensive. We present SpotAttention, a lightweight selector that attaches to a frozen pretrained transformer and learns by KL distillation to estimate its attention distribution. The selector picks the top-K keys each query attends to, and because its estimate is a calibrated distribution, a dual top-p rule reads the per-query, per-layer budget directly from it. Across Qwen3 (dense, 4B–32B) and Qwen3.5 (hybrid linear/full attention, 4B–9B), SpotAttention matches dense accuracy at contexts up to 128K tokens, eight times the training length. Decode at L=128K runs 3.9× faster than FlashAttention and 1.8× faster than Twilight, the strongest training-free baseline. Quantizing the selector's K-cache to INT4 or FP4 microscale shrinks it 3.5× at no accuracy cost.

BibTeX

@misc{ahmad2026spotattention,
  title        = {SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers},
  author       = {Ahmad, Huzama and Yun, Se-Young},
  howpublished = {Preprint},
  year         = {2026}
}