Jiaqi Leng (冷家祺)

I am a final-year undergraduate student in Computer Science and Technology at Fudan University. During Fall 2024, I was an exchange student at The University of Texas at Austin.

Currently, I am working as a research intern at NYU Shanghai with Prof. Yucheng Lu, focusing on efficient byte-level modeling. Previously, I worked as a research intern at Ant Group, collaborating with Xiang Hu on efficient attention mechanisms for large language models.

My research interests mainly lie in:

Efficient deep learning and model architectures
Long-context modeling and length extrapolation
Sparse attention mechanisms

Please feel free to reach out! 👋

selected publications

Preprint

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

Jiaqi Leng^*, Xiang Hu^*, Junxiong Wang, Jianguo Li, Wei Wu, and Yucheng Lu

arXiv preprint, 2025

Abs arXiv

Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong.
NeurIPS

Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention

Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, and Wei Wu

2025

Abs arXiv

A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose Hierarchical Sparse Attention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selects the top-k chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba’s huge potential in long-context modeling.