- Venue: Preprint
- Year: 2026
SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers
Authors: Huzama Ahmad, Se-Young Yun
- long-context
- efficiency
- sparse-attention
Abstract
Long contexts have become standard in pretrained LLMs, yet they remain expensive to run: prefill compute grows quadratically with sequence length, and every decode step re-reads a key-value cache that grows linearly with it. Sparse attention cuts these costs by attending only to a relevant subset of past tokens, but selecting that subset is itself expensive. We present SpotAttention, a lightweight selector that attaches to a frozen pretrained transformer and learns by KL distillation to estimate its attention distribution. The selector picks the top-K keys each query attends to, and because its estimate is a calibrated distribution, a dual top-p rule reads the per-query, per-layer budget directly from it. Across Qwen3 (dense, 4B–32B) and Qwen3.5 (hybrid linear/full attention, 4B–9B), SpotAttention matches dense accuracy at contexts up to 128K tokens, eight times the training length. Decode at L=128K runs 3.9× faster than FlashAttention and 1.8× faster than Twilight, the strongest training-free baseline. Quantizing the selector's K-cache to INT4 or FP4 microscale shrinks it 3.5× at no accuracy cost.
BibTeX
@misc{ahmad2026spotattention,
title = {SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers},
author = {Ahmad, Huzama and Yun, Se-Young},
howpublished = {Preprint},
year = {2026}
}