Workshop

Papers, projects, and the systems behind them.

Project Status: Active

Content-Aware Sparsity

Dates: Jul 2026 – Present

Stealth — details after publication
Venue: Preprint Year: 2026

Large Language Models Can Control Their Own Attention Span

Authors: Namgyu Ho^{* (equal contribution)}, Huzama Ahmad^{* (equal contribution)}, Woosung Koh^{* (equal contribution)}, Se-Young Yun, Tal Schuster, Cicero Nogueira dos Santos

A prompting protocol that lets a model declare where it will attend, cutting decoding attention cost up to 53.1% at near-zero accuracy loss.
Type: preprint Venue: Under Review Year: 2026

SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers

Authors: Huzama Ahmad, Se-Young Yun

A plug-in selector that matches dense accuracy at long context while decoding 3.9× faster than FlashAttention.
Type: workshop Venue: AdaptFM @ ICML Year: 2026 Oral Presentation

BASTION: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

Authors: Soowon Oh, Nam Cao, Yujin Kim, Hojung Jung, Huzama Ahmad, Sangmin Bae, Se-Young Yun

Budget-aware speculative decoding with tree-structured diffusion drafting, up to 6.61× faster than autoregressive decoding.
System Status: Active

GPU Cluster: OSI Lab

Dates: Jan 2026 – Present

Built and operate a 9-node, 67-GPU Slurm cluster for 50+ researchers, handling identity, networking, storage, and strict per-job GPU isolation.
Type: preprint Venue: arXiv Year: 2025

CascadeFormer: Depth-Tapered Transformers Motivated by Gradient Fan-in Asymmetry

Authors: Huzama Ahmad, Cao Viet Hai Nam, Se-Young Yun

Depth-tapered Transformers and gradient-based layer pruning, both motivated by Gradient Fan-in Asymmetry, the structural reason deep layers contribute less, cutting latency 8.6% and raising throughput 9.4% at equal perplexity.
Type: conference Venue: ACL Year: 2025 Oral Presentation

Diffusion Models Through a Global Lens: Are They Culturally Inclusive?

Authors: Zahra Bayramli, Ayhan Suleymanzade, Na Min An, Huzama Ahmad, Eunsu Kim, Junyeong Park, James Thorne, Alice Oh

CULTDIFF: a ten-country benchmark exposing where text-to-image diffusion models miss cultural specificity, plus a metric that tracks human judgment.
Type: workshop Venue: C3NLP @ NAACL Year: 2025 Outstanding Paper Award

When Tom Eats Kimchi: Evaluating Cultural Awareness of Multimodal Large Language Models in Cultural Mixture Contexts

Authors: Jun Seong Kim, Kyaw Ye Thu, Javad Ismayilzada, Junyeong Park, Eunsu Kim, Huzama Ahmad, Na Min An, James Thorne, Alice Oh

MIXCUBE: a cross-cultural benchmark showing multimodal LLMs misjudge cultural entities by a person's ethnicity, with accuracy gaps up to 58% in low-resource cultures.
Project Status: Completed

Foundations of Efficient LLMs

Dates: Oct 2023 – Sep 2025

Two years measuring how much of a large model's computation is actually necessary, and the groundwork for the long-context sparsity work that followed: GSPMD training on TPU pods, context compression, uncertainty-aware prediction, and depth-tapered Transformers, with contributions upstreamed to PyTorch/XLA and Transformers.
System Status: Active

HomeLab

Dates: Jun 2023 – Present

A personal cloud built from the silicon up, running on Proxmox and TrueNAS over ZFS with pfSense on a 2.5-gig network, where production-grade tooling gets a playground's freedom.
Project Status: Completed

Self-Guided Framework for Improving Arithmetic Reasoning in Large Language Models with Reinforcement Learning

Dates: May 2023 – Sep 2023

A self-tuning framework where a model sharpens its own math reasoning by judging its greedy answer against its own sampled alternatives, lifting accuracy up to 5% across four benchmarks with no human grader in the loop.
Project Status: Completed

ChatGPT in Low-Resource Languages

Dates: Mar 2023 – Dec 2023

Advised eight undergraduate teams stress-testing how well GPT-3.5 holds up once you leave English, across reasoning, sentiment, QA, and standardized exams, each in a different low-resource language.
System Status: Completed

GPU Cluster: XFACT Lab

Dates: Mar 2023 – Dec 2025

Maintained the identity, storage, and scheduling layer for the XFACT lab's 6-node GPU cluster, using FreeIPA, TrueNAS, and Slurm.