Jiaqi Leng (冷家祺)

I am a final-year undergraduate student in Computer Science at Fudan University.

My research focuses on building efficient and scalable language models through machine learning systems and model architecture design.

My research interests include:

Efficient deep learning systems and model architectures
Long-context modeling and length extrapolation
Sparse attention mechanisms for large language models

I welcome opportunities for academic collaboration and discussion.

News

Apr 2026	I presented our work on length-generalizable sparse attention at ICLR 2026. [Poster]

Selected Publications

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

Jiaqi Leng^*, Xiang Hu^*, Junxiong Wang, Jianguo Li, Wei Wu, and Yucheng Lu

In Proceedings of the 14th International Conference on Learning Representations (ICLR), 2026

Abs arXiv Code Poster
We analyze chunk-based sparse attention for long-context generalization and identify its essential ingredients.

Key principles:

non-linear Chunk Encoder with a dedicated CLS token for retrieval

Bypassing Residual Path to integrate global information

enforced selection sparsity in pre-training to close the train–test gap

Evidence: theoretical motivation and unified ablations

Result: state-of-the-art training-free length extrapolation, generalizing 4K-trained models to 32M tokens on RULER and BABILong.
Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access

Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, and Wei Wu

In Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025

Abs arXiv Code
We introduce Hierarchical Sparse Attention (HSA) to give RNNs efficient long-range random access without losing linear complexity.

Method: chunk inputs, select top-k chunks, and hierarchically aggregate using token-level relevance; a hardware-aligned kernel keeps it efficient

System: Mamba + HSA (RAMba)

Result: perfect passkey retrieval up to 64M tokens from 4K pre-training, plus strong downstream gains with near-constant memory.