Content-Aware Sparsity
Dates:
Stealth — details after publication

Ph.D. Candidate, KAIST AI.
A language model uses only a little of what it reads, yet pays for all of it. I work on cutting that cost, and most of it is knowing where a model should look. My recent work plugs sparse attention into existing models, lets a model control its own attention span, and prunes context by content rather than position.
Dates:
Stealth — details after publication
Authors: Namgyu Ho * (equal contribution) , Huzama Ahmad * (equal contribution) , Woosung Koh * (equal contribution) , Se-Young Yun, Tal Schuster, Cicero Nogueira dos Santos
A prompting protocol that lets a model declare where it will attend, cutting decoding attention cost up to 53.1% at near-zero accuracy loss.
Authors: Huzama Ahmad, Se-Young Yun
A plug-in selector that matches dense accuracy at long context while decoding 3.9× faster than FlashAttention.