Under development — I'm actively building this site.

Back to publications

Large Language Models Can Control Their Own Attention Span

Authors: Namgyu Ho*, Huzama Ahmad*, Woosung Koh*, Cicero Nogueira dos Santos, Tal Schuster, Se-Young Yun

Abstract

LLMs spend most of their attention on a small fraction of context, yet they must read the entire KV cache at every decoding step to find it. In a 1M-token conversation where the user asks about a detail mentioned briefly in the middle, the model still scans the full history to generate each word of the reply. Existing methods for reducing this cost mostly approximate an attention mask from internal activations, but the approximation still costs O(N) per step. In this paper, we introduce Dynamic Attention Control (DAC), a prompting protocol that elicits the model to declare where it will attend as part of its chain of thought, partitioning generation into three modes: global (full context), focus (a specific region), and local (recent context only). The attention mask is read directly from the model's reasoning, with no auxiliary scorer. This unique design allows the model to dynamically modify its internal attention function at inference time via its own generated text on-the-fly. Across 15 long-context tasks on off-the-shelf models, DAC reduces decoding attention cost by 15.1% on Qwen-3.6-27B and 53.1% on Gemma-4-31B with accuracy drops of only 1.41pp and 0.08pp, zero-shot. We extend vLLM to support in-place KV cache masking with block-aligned masks compatible with state-of-the-art kernels including FlashAttention.

BibTeX

@misc{ho2026span,
  title        = {Large Language Models Can Control Their Own Attention Span},
  author       = {Ho, Namgyu and Ahmad, Huzama and Koh, Woosung and Santos, Cicero Nogueira dos and Schuster, Tal and Yun, Se-Young},
  howpublished = {Preprint},
  year         = {2026}
}

BibTeX