SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers
Authors: Huzama Ahmad, Se-Young Yun
A plug-in selector that matches dense accuracy out to 128K tokens while decoding 3.9× faster than FlashAttention.

Ph.D. Candidate, KAIST AI.
I work on efficient language modeling — the architectures and systems that make large models cheaper to run at long context. Advised by Se-Young Yun in the OSI Lab. Recent work spans speculative decoding, sparse attention, and letting models control their own attention span.
Authors: Huzama Ahmad, Se-Young Yun
A plug-in selector that matches dense accuracy out to 128K tokens while decoding 3.9× faster than FlashAttention.
Authors: Namgyu Ho * (equal contribution) , Huzama Ahmad * (equal contribution) , Woosung Koh * (equal contribution) , Cicero Nogueira dos Santos, Tal Schuster, Se-Young Yun
A prompting protocol that lets a model declare where it will attend — cutting decoding attention cost up to 53.1% at near-zero accuracy loss.