SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers
Authors: Huzama Ahmad, Se-Young Yun
A plug-in selector that matches dense accuracy out to 128K tokens while decoding 3.9× faster than FlashAttention.
Papers, projects, and the systems behind them.
Authors: Huzama Ahmad, Se-Young Yun
A plug-in selector that matches dense accuracy out to 128K tokens while decoding 3.9× faster than FlashAttention.
Dates:
Stealth — details after publication
Authors: Soowon Oh, Nam Cao, Yujin Kim, Hojung Jung, Huzama Ahmad, Sangmin Bae, Se-Young Yun
Budget-aware speculative decoding with tree-structured diffusion drafting — up to 6.61× faster than autoregressive decoding.
Authors: Namgyu Ho * (equal contribution) , Huzama Ahmad * (equal contribution) , Woosung Koh * (equal contribution) , Cicero Nogueira dos Santos, Tal Schuster, Se-Young Yun
A prompting protocol that lets a model declare where it will attend — cutting decoding attention cost up to 53.1% at near-zero accuracy loss.
Dates:
Traces deep-Transformer layer redundancy to a structural gradient bottleneck — yielding a better pruning rule and a tapered architecture that cuts latency 8.6% and lifts throughput 9.4%.
Authors: Zahra Bayramli, Ayhan Suleymanzade, Na Min An, Huzama Ahmad, Eunsu Kim, Junyeong Park, James Thorne, Alice Oh
CULTDIFF: a ten-country benchmark exposing where text-to-image diffusion models miss cultural specificity, plus a metric that tracks human judgment.
Authors: Jun Seong Kim, Kyaw Ye Thu, Javad Ismayilzada, Junyeong Park, Eunsu Kim, Huzama Ahmad, Na Min An, James Thorne, Alice Oh
MIXCUBE: a cross-cultural benchmark showing multimodal LLMs misjudge cultural entities by a person's ethnicity — up to 58% accuracy gaps in low-resource cultures.
Dates:
Pivot-Assisted Consensus: multi-pivot prompting that lifts LLM accuracy on low-resource languages, with analysis of why linguistic and cultural fit matters.
Dates:
A self-guided RL framework that improves chain-of-thought arithmetic reasoning using self-logicality as the reward — no human-graded supervision.
Dates:
Built and operate a 9-node, 68-GPU Slurm cluster for 50+ researchers — identity, networking, storage, and strict per-job GPU isolation.