Huzama Ahmad

Ph.D. Candidate, KAIST AI.

A language model uses only a little of what it reads, yet pays for all of it. I work on cutting that cost, and most of it is knowing where a model should look. My recent work plugs sparse attention into existing models, lets a model control its own attention span, and prunes context by content rather than position.

Selected work

All

Project Status: Active

Content-Aware Sparsity

Dates: Jul 2026 – Present

Stealth — details after publication
Venue: Preprint Year: 2026

Large Language Models Can Control Their Own Attention Span

Authors: Namgyu Ho^{* (equal contribution)}, Huzama Ahmad^{* (equal contribution)}, Woosung Koh^{* (equal contribution)}, Se-Young Yun, Tal Schuster, Cicero Nogueira dos Santos

A prompting protocol that lets a model declare where it will attend, cutting decoding attention cost up to 53.1% at near-zero accuracy loss.
Type: preprint Venue: Under Review Year: 2026

SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers

Authors: Huzama Ahmad, Se-Young Yun

A plug-in selector that matches dense accuracy at long context while decoding 3.9× faster than FlashAttention.

Huzama Ahmad

Selected work

Content-Aware Sparsity

Large Language Models Can Control Their Own Attention Span

SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers