<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="/rss.xsl" type="text/xsl"?><rss version="2.0"><channel><title>Huzama Ahmad — Updates</title><description>Publications and projects from Huzama Ahmad — Ph.D. candidate at KAIST AI working on efficient language modeling.</description><link>https://huzama.com/</link><language>en-us</language><item><title>Large Language Models Can Control Their Own Attention Span</title><link>https://huzama.com/publications/2026-05-attention-span/</link><guid isPermaLink="true">https://huzama.com/publications/2026-05-attention-span/</guid><description>LLMs spend most of their attention on a small fraction of context, yet
they must read the entire KV cache at every decoding step to find it. In a
1M-token conversation where the user asks about a detail mentioned briefly
in the middle, the model still scans the full history to generate each
word of the reply. Existing methods for reducing this cost mostly
approximate an attention mask from internal activations, but the
approximation still costs O(N) per step. In this paper, we introduce
Dynamic Attention Control (DAC), a prompting protocol that elicits the
model to declare where it will attend as part of its chain of thought,
partitioning generation into three modes: global (full context), focus (a
specific region), and local (recent context only). The attention mask is
read directly from the model&apos;s reasoning, with no auxiliary scorer. This
unique design allows the model to dynamically modify its internal
attention function at inference time via its own generated text
on-the-fly. Across 15 long-context tasks on off-the-shelf models, DAC
reduces decoding attention cost by 15.1% on Qwen-3.6-27B and 53.1% on
Gemma-4-31B with accuracy drops of only 1.41pp and 0.08pp, zero-shot. We
extend vLLM to support in-place KV cache masking with block-aligned masks
compatible with state-of-the-art kernels including FlashAttention.
</description><pubDate>Thu, 01 Jan 2026 00:00:00 GMT</pubDate><category>publication</category><category>long-context</category><category>efficiency</category><category>kv-cache</category><author>Namgyu Ho, Huzama Ahmad, Woosung Koh, Cicero Nogueira dos Santos, Tal Schuster, Se-Young Yun</author></item><item><title>BASTION: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting</title><link>https://huzama.com/publications/2026-05-bastion/</link><guid isPermaLink="true">https://huzama.com/publications/2026-05-bastion/</guid><description>Block-diffusion drafters have recently emerged as a powerful alternative
for speculative decoding by predicting multiple future-token distributions
in a single parallel step. However, since these parallel predictions are
sampled from position-wise marginals rather than fully conditioned
sequences, committing to a single greedy path often fails to capture the
target model&apos;s preferred trajectory. To address this, we propose BASTION,
a budget-aware speculative decoding framework with tree-based diffusion
drafting. Unlike existing methods that rely on static tree topologies,
BASTION dynamically constructs query-dependent trees by balancing draft
quality against hardware constraints. Our framework integrates three
synergistic components: (1) an acceptance surrogate that estimates expected
accepted length via path confidence, (2) an online latency estimator that
calibrates a hardware-aware roofline model, and (3) an adaptive best-first
expansion that grows the tree until marginal gains no longer justify
incremental verification costs. BASTION is training-free, preserves the
target model&apos;s distribution, and requires no per-setting tuning. Across
diverse benchmarks and GPU architectures, BASTION achieves up to a 6.61×
speedup over standard autoregressive decoding, outperforming
state-of-the-art block-diffusion baselines by 39%.
</description><pubDate>Thu, 01 Jan 2026 00:00:00 GMT</pubDate><category>publication</category><category>speculative-decoding</category><category>efficiency</category><category>long-context</category><author>Soowon Oh, Nam Cao, Yujin Kim, Hojung Jung, Huzama Ahmad, Sangmin Bae, Se-Young Yun</author></item><item><title>SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers</title><link>https://huzama.com/publications/2026-05-spotattention/</link><guid isPermaLink="true">https://huzama.com/publications/2026-05-spotattention/</guid><description>Long contexts have become standard in pretrained LLMs, yet they remain
expensive to run: prefill compute grows quadratically with sequence
length, and every decode step re-reads a key-value cache that grows
linearly with it. Sparse attention cuts these costs by attending only to a
relevant subset of past tokens, but selecting that subset is itself
expensive. We present SpotAttention, a lightweight selector that attaches
to a frozen pretrained transformer and learns by KL distillation to
estimate its attention distribution. The selector picks the top-K keys
each query attends to, and because its estimate is a calibrated
distribution, a dual top-p rule reads the per-query, per-layer budget
directly from it. Across Qwen3 (dense, 4B–32B) and Qwen3.5 (hybrid
linear/full attention, 4B–9B), SpotAttention matches dense accuracy at
contexts up to 128K tokens, eight times the training length. Decode at
L=128K runs 3.9× faster than FlashAttention and 1.8× faster than Twilight,
the strongest training-free baseline. Quantizing the selector&apos;s K-cache to
INT4 or FP4 microscale shrinks it 3.5× at no accuracy cost.
</description><pubDate>Thu, 01 Jan 2026 00:00:00 GMT</pubDate><category>publication</category><category>long-context</category><category>efficiency</category><category>sparse-attention</category><author>Huzama Ahmad, Se-Young Yun</author></item><item><title>Gradient Fan-in Asymmetry: The Structural Cause of Layer Redundancy in Deep Transformers</title><link>https://huzama.com/projects/2025-09-gradient-fan-in-asymmetry/</link><guid isPermaLink="true">https://huzama.com/projects/2025-09-gradient-fan-in-asymmetry/</guid><description>Deep Transformers are composed of uniformly stacked residual blocks, yet
their deepest layers often add little value. Prevailing explanations
attribute this to small gradients, treating a symptom rather than the
cause. We identify Gradient Fan-in Asymmetry as the structural driver of
redundancy. In Pre-LayerNorm residual stacks, the gradient at a layer is
the sum of an identity path and all downstream functional paths,
producing a gradient fan-in that decays linearly with depth (and
quadratically under deep supervision), yielding rich signals early and
sparse for later layers. Across Transformers and ResNets, accumulated
training gradients follow the theoretical fan-in and predict post hoc
layer importance. Two causal interventions isolate structure as the
bottleneck: equalizing per-layer gradient norms does not restore
late-layer value, whereas increasing downstream path counts via
parameter-shared repetition restores and elevates their impact. Building
on this mechanism, we propose CascadeFlow Pruning, which removes layers
using accumulated training gradients and outperforms standard heuristics
without expensive post hoc analysis. We also introduce CascadeFormer,
which tapers width with depth to match the natural information flow,
achieving comparable perplexity to a uniform baseline at the same
training budget while reducing latency by 8.6% and increasing throughput
by 9.4%.
</description><pubDate>Mon, 01 Sep 2025 00:00:00 GMT</pubDate><category>project</category><category>efficiency</category><category>interpretability</category><author>Huzama Ahmad, Cao Viet Hai Nam, Se-Young Yun</author></item><item><title>When Tom Eats Kimchi: Evaluating Cultural Awareness of Multimodal Large Language Models in Cultural Mixture Contexts</title><link>https://huzama.com/publications/2025-05-tom-eats-kimchi/</link><guid isPermaLink="true">https://huzama.com/publications/2025-05-tom-eats-kimchi/</guid><description>In a highly globalized world, it is important for multi-modal large
language models (MLLMs) to recognize and respond correctly to
mixed-cultural inputs. For example, a model should correctly identify
kimchi (Korean food) in an image both when an Asian woman is eating it,
as well as an African man is eating it. However, current MLLMs show an
over-reliance on the visual features of the person, leading to
misclassification of the entities. To examine the robustness of MLLMs to
different ethnicity, we introduce MIXCUBE, a cross-cultural bias
benchmark, and study elements from five countries and four ethnicities.
Our findings reveal that MLLMs achieve both higher accuracy and lower
sensitivity to such perturbation for high-resource cultures, but not for
low-resource cultures. GPT-4o, the best-performing model overall, shows
up to 58% difference in accuracy between the original and perturbed
cultural settings in low-resource cultures.
</description><pubDate>Wed, 01 Jan 2025 00:00:00 GMT</pubDate><category>publication</category><category>multimodal</category><category>cultural-evaluation</category><category>llms</category><author>Jun Seong Kim, Kyaw Ye Thu, Javad Ismayilzada, Junyeong Park, Eunsu Kim, Huzama Ahmad, Na Min An, James Thorne, Alice Oh</author></item><item><title>Diffusion Models Through a Global Lens: Are They Culturally Inclusive?</title><link>https://huzama.com/publications/2025-07-diffusion-cultural-lens/</link><guid isPermaLink="true">https://huzama.com/publications/2025-07-diffusion-cultural-lens/</guid><description>Text-to-image diffusion models have recently enabled the creation of
visually compelling, detailed images from textual prompts. However, their
ability to accurately represent various cultural nuances remains an open
question. In our work, we introduce CULTDIFF benchmark, evaluating
whether state-of-the-art diffusion models can generate culturally
specific images spanning ten countries. We show that these models often
fail to generate cultural artifacts in architecture, clothing, and food,
especially for underrepresented country regions, by conducting a
fine-grained analysis of different similarity aspects, revealing
significant disparities in cultural relevance, description fidelity, and
realism compared to real-world reference images. With the collected human
evaluations, we develop a neural-based image-image similarity metric,
namely, CULTDIFF-S, to predict human judgment on real and generated
images with cultural artifacts. Our work highlights the need for more
inclusive generative AI systems and equitable dataset representation over
a wide range of cultures.
</description><pubDate>Wed, 01 Jan 2025 00:00:00 GMT</pubDate><category>publication</category><category>diffusion-models</category><category>cultural-evaluation</category><category>multimodal</category><author>Zahra Bayramli, Ayhan Suleymanzade, Na Min An, Huzama Ahmad, Eunsu Kim, Junyeong Park, James Thorne, Alice Oh</author></item><item><title>PAC: Analyzing Efficacy of Pivot Techniques in LLMs for Low-Resource Languages</title><link>https://huzama.com/projects/2023-09-pac-pivot-low-resource/</link><guid isPermaLink="true">https://huzama.com/projects/2023-09-pac-pivot-low-resource/</guid><description>This study investigates the effectiveness of Large Language Models (LLMs)
in processing Low Resource Languages (LRLs) using a novel approach called
Pivot-Assisted Consensus (PAC), which integrates a pivot language with a
multi-source consensus mechanism. A comprehensive series of ablation
experiments were conducted to evaluate the performance of this method on
various linguistic tasks, including mathematical reasoning, sentiment
analysis, and natural language inference. The research shows that
linguistic and cultural compatibility play a crucial role in pivot
language selection, leading to a significant improvement in task accuracy
across all examined scenarios. It highlights the significance of cultural
awareness and the utilization of multiple language resources to overcome
data scarcity for LRLs, resulting in more sophisticated and accurate LLM
outputs. In addition, we provide comprehensive analyses of our results to
enhance the comprehension of LLM capabilities, which facilitates the
development of more transparent and interpretable models.
</description><pubDate>Fri, 01 Sep 2023 00:00:00 GMT</pubDate><category>project</category><category>multilingual</category><category>cultural</category><author>Uzair Ahmed, Muhammad Faizan Zahid, Huzama Ahmad, James Thorne, Alice Oh</author></item><item><title>Self-Guided Framework for Improving Arithmetic Reasoning in Large Language Models with Reinforcement Learning</title><link>https://huzama.com/projects/2023-07-steper-self-guided-rl/</link><guid isPermaLink="true">https://huzama.com/projects/2023-07-steper-self-guided-rl/</guid><description>Large language models have demonstrated their ability for multi-step
reasoning in complex arithmetic problems when prompted with
chain-of-thought instructions. This paper introduces a novel self-guided
framework that uses reinforcement learning to improve the reasoning
capabilities of large language models. Our framework encourages the
generation of logical explanations by actively exploring and refining
various reasoning paths, with self-logicality serving as a reward signal.
Experimental results show the effectiveness of our approach on both
encoder-decoder and autoregressive models. Quantitative evaluations on
four different arithmetic reasoning datasets show that language models
can achieve precise reasoning abilities through our framework.
Additionally, evaluations conducted by human experts and automated
systems confirm that our framework leads to improved logicality and
coherence in chain-of-thought reasoning.
</description><pubDate>Sat, 01 Jul 2023 00:00:00 GMT</pubDate><category>project</category><category>reasoning-rl</category><category>self-guided</category><author>Jiwoo Hong, Huzama Ahmad, Minsu Kim, James Thorne</author></item></channel></rss>