Huzama Ahmad — Updates

Huzama Ahmad — UpdatesPublications and projects from Huzama Ahmad — Ph.D. candidate at KAIST AI working on efficient language modeling.https://huzama.com/en-usLarge Language Models Can Control Their Own Attention Spanhttps://huzama.com/publications/2026-05-attention-span/https://huzama.com/publications/2026-05-attention-span/LLMs spend most of their attention on a small fraction of context, yet they must read the entire KV cache at every decoding step to find it. In a 1M-token conversation where the user asks about a detail mentioned briefly in the middle, the model still scans the full history to generate each word of the reply. Existing methods for reducing this cost mostly approximate an attention mask from internal activations, but the approximation still costs O(N) per step. In this paper, we introduce Dynamic Attention Control (DAC), a prompting protocol that elicits the model to declare where it will attend as part of its chain of thought, partitioning generation into three modes: global (full context), focus (a specific region), and local (recent context only). The attention mask is read directly from the model's reasoning, with no auxiliary scorer. This unique design allows the model to dynamically modify its internal attention function at inference time via its own generated text on-the-fly. Across 15 long-context tasks on off-the-shelf models, DAC reduces decoding attention cost by 15.1% on Qwen-3.6-27B and 53.1% on Gemma-4-31B with accuracy drops of only 1.41pp and 0.08pp, zero-shot. We extend vLLM to support in-place KV cache masking with block-aligned masks compatible with state-of-the-art kernels including FlashAttention. Thu, 01 Jan 2026 00:00:00 GMTpublicationlong-contextefficiencykv-cacheNamgyu Ho, Huzama Ahmad, Woosung Koh, Cicero Nogueira dos Santos, Tal Schuster, Se-Young YunBASTION: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Draftinghttps://huzama.com/publications/2026-05-bastion/https://huzama.com/publications/2026-05-bastion/Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled from position-wise marginals rather than fully conditioned sequences, committing to a single greedy path often fails to capture the target model's preferred trajectory. To address this, we propose BASTION, a budget-aware speculative decoding framework with tree-based diffusion drafting. Unlike existing methods that rely on static tree topologies, BASTION dynamically constructs query-dependent trees by balancing draft quality against hardware constraints. Our framework integrates three synergistic components: (1) an acceptance surrogate that estimates expected accepted length via path confidence, (2) an online latency estimator that calibrates a hardware-aware roofline model, and (3) an adaptive best-first expansion that grows the tree until marginal gains no longer justify incremental verification costs. BASTION is training-free, preserves the target model's distribution, and requires no per-setting tuning. Across diverse benchmarks and GPU architectures, BASTION achieves up to a 6.61× speedup over standard autoregressive decoding, outperforming state-of-the-art block-diffusion baselines by 39%. Thu, 01 Jan 2026 00:00:00 GMTpublicationspeculative-decodingefficiencylong-contextSoowon Oh, Nam Cao, Yujin Kim, Hojung Jung, Huzama Ahmad, Sangmin Bae, Se-Young YunSpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformershttps://huzama.com/publications/2026-05-spotattention/https://huzama.com/publications/2026-05-spotattention/Long contexts have become standard in pretrained LLMs, yet they remain expensive to run: prefill compute grows quadratically with sequence length, and every decode step re-reads a key-value cache that grows linearly with it. Sparse attention cuts these costs by attending only to a relevant subset of past tokens, but selecting that subset is itself expensive. We present SpotAttention, a lightweight selector that attaches to a frozen pretrained transformer and learns by KL distillation to estimate its attention distribution. The selector picks the top-K keys each query attends to, and because its estimate is a calibrated distribution, a dual top-p rule reads the per-query, per-layer budget directly from it. Across Qwen3 (dense, 4B–32B) and Qwen3.5 (hybrid linear/full attention, 4B–9B), SpotAttention matches dense accuracy at contexts up to 128K tokens, eight times the training length. Decode at L=128K runs 3.9× faster than FlashAttention and 1.8× faster than Twilight, the strongest training-free baseline. Quantizing the selector's K-cache to INT4 or FP4 microscale shrinks it 3.5× at no accuracy cost. Thu, 01 Jan 2026 00:00:00 GMTpublicationlong-contextefficiencysparse-attentionHuzama Ahmad, Se-Young YunGradient Fan-in Asymmetry: The Structural Cause of Layer Redundancy in Deep Transformershttps://huzama.com/projects/2025-09-gradient-fan-in-asymmetry/https://huzama.com/projects/2025-09-gradient-fan-in-asymmetry/Deep Transformers are composed of uniformly stacked residual blocks, yet their deepest layers often add little value. Prevailing explanations attribute this to small gradients, treating a symptom rather than the cause. We identify Gradient Fan-in Asymmetry as the structural driver of redundancy. In Pre-LayerNorm residual stacks, the gradient at a layer is the sum of an identity path and all downstream functional paths, producing a gradient fan-in that decays linearly with depth (and quadratically under deep supervision), yielding rich signals early and sparse for later layers. Across Transformers and ResNets, accumulated training gradients follow the theoretical fan-in and predict post hoc layer importance. Two causal interventions isolate structure as the bottleneck: equalizing per-layer gradient norms does not restore late-layer value, whereas increasing downstream path counts via parameter-shared repetition restores and elevates their impact. Building on this mechanism, we propose CascadeFlow Pruning, which removes layers using accumulated training gradients and outperforms standard heuristics without expensive post hoc analysis. We also introduce CascadeFormer, which tapers width with depth to match the natural information flow, achieving comparable perplexity to a uniform baseline at the same training budget while reducing latency by 8.6% and increasing throughput by 9.4%. Mon, 01 Sep 2025 00:00:00 GMTprojectefficiencyinterpretabilityHuzama Ahmad, Cao Viet Hai Nam, Se-Young YunWhen Tom Eats Kimchi: Evaluating Cultural Awareness of Multimodal Large Language Models in Cultural Mixture Contextshttps://huzama.com/publications/2025-05-tom-eats-kimchi/https://huzama.com/publications/2025-05-tom-eats-kimchi/In a highly globalized world, it is important for multi-modal large language models (MLLMs) to recognize and respond correctly to mixed-cultural inputs. For example, a model should correctly identify kimchi (Korean food) in an image both when an Asian woman is eating it, as well as an African man is eating it. However, current MLLMs show an over-reliance on the visual features of the person, leading to misclassification of the entities. To examine the robustness of MLLMs to different ethnicity, we introduce MIXCUBE, a cross-cultural bias benchmark, and study elements from five countries and four ethnicities. Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbation for high-resource cultures, but not for low-resource cultures. GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures. Wed, 01 Jan 2025 00:00:00 GMTpublicationmultimodalcultural-evaluationllmsJun Seong Kim, Kyaw Ye Thu, Javad Ismayilzada, Junyeong Park, Eunsu Kim, Huzama Ahmad, Na Min An, James Thorne, Alice OhDiffusion Models Through a Global Lens: Are They Culturally Inclusive?https://huzama.com/publications/2025-07-diffusion-cultural-lens/https://huzama.com/publications/2025-07-diffusion-cultural-lens/Text-to-image diffusion models have recently enabled the creation of visually compelling, detailed images from textual prompts. However, their ability to accurately represent various cultural nuances remains an open question. In our work, we introduce CULTDIFF benchmark, evaluating whether state-of-the-art diffusion models can generate culturally specific images spanning ten countries. We show that these models often fail to generate cultural artifacts in architecture, clothing, and food, especially for underrepresented country regions, by conducting a fine-grained analysis of different similarity aspects, revealing significant disparities in cultural relevance, description fidelity, and realism compared to real-world reference images. With the collected human evaluations, we develop a neural-based image-image similarity metric, namely, CULTDIFF-S, to predict human judgment on real and generated images with cultural artifacts. Our work highlights the need for more inclusive generative AI systems and equitable dataset representation over a wide range of cultures. Wed, 01 Jan 2025 00:00:00 GMTpublicationdiffusion-modelscultural-evaluationmultimodalZahra Bayramli, Ayhan Suleymanzade, Na Min An, Huzama Ahmad, Eunsu Kim, Junyeong Park, James Thorne, Alice OhPAC: Analyzing Efficacy of Pivot Techniques in LLMs for Low-Resource Languageshttps://huzama.com/projects/2023-09-pac-pivot-low-resource/https://huzama.com/projects/2023-09-pac-pivot-low-resource/This study investigates the effectiveness of Large Language Models (LLMs) in processing Low Resource Languages (LRLs) using a novel approach called Pivot-Assisted Consensus (PAC), which integrates a pivot language with a multi-source consensus mechanism. A comprehensive series of ablation experiments were conducted to evaluate the performance of this method on various linguistic tasks, including mathematical reasoning, sentiment analysis, and natural language inference. The research shows that linguistic and cultural compatibility play a crucial role in pivot language selection, leading to a significant improvement in task accuracy across all examined scenarios. It highlights the significance of cultural awareness and the utilization of multiple language resources to overcome data scarcity for LRLs, resulting in more sophisticated and accurate LLM outputs. In addition, we provide comprehensive analyses of our results to enhance the comprehension of LLM capabilities, which facilitates the development of more transparent and interpretable models. Fri, 01 Sep 2023 00:00:00 GMTprojectmultilingualculturalUzair Ahmed, Muhammad Faizan Zahid, Huzama Ahmad, James Thorne, Alice OhSelf-Guided Framework for Improving Arithmetic Reasoning in Large Language Models with Reinforcement Learninghttps://huzama.com/projects/2023-07-steper-self-guided-rl/https://huzama.com/projects/2023-07-steper-self-guided-rl/Large language models have demonstrated their ability for multi-step reasoning in complex arithmetic problems when prompted with chain-of-thought instructions. This paper introduces a novel self-guided framework that uses reinforcement learning to improve the reasoning capabilities of large language models. Our framework encourages the generation of logical explanations by actively exploring and refining various reasoning paths, with self-logicality serving as a reward signal. Experimental results show the effectiveness of our approach on both encoder-decoder and autoregressive models. Quantitative evaluations on four different arithmetic reasoning datasets show that language models can achieve precise reasoning abilities through our framework. Additionally, evaluations conducted by human experts and automated systems confirm that our framework leads to improved logicality and coherence in chain-of-thought reasoning. Sat, 01 Jul 2023 00:00:00 GMTprojectreasoning-rlself-guidedJiwoo Hong, Huzama Ahmad, Minsu Kim, James Thorne