Publications

Publications by Jiaqi Leng, generated and organized with jekyll-scholar.

2026

ICLR 2026

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

Jiaqi Leng^*, Xiang Hu^*, Junxiong Wang, Jianguo Li, Wei Wu, and Yucheng Lu

In Proceedings of the 14th International Conference on Learning Representations, 2026

Abs arXiv Code
We analyze chunk-based sparse attention for long-context generalization and identify its essential ingredients.

Key principles:

non-linear Chunk Encoder with a dedicated CLS token for retrieval

Bypassing Residual Path to integrate global information

enforced selection sparsity in pre-training to close the train–test gap

Evidence: theoretical motivation and unified ablations

Result: state-of-the-art training-free length extrapolation, generalizing 4K-trained models to 32M tokens on RULER and BABILong.
Preprint

Distilling Token-Trained Models into Byte-Level Models

Zishuo Bao, Jiaqi Leng, Junxiong Wang, Bowen Peng, and Yucheng Lu

arXiv preprint, 2026

Abs arXiv
We propose a low-cost distillation pipeline to convert token-trained LLMs into byte-level models without training from scratch.

Method:

Progressive Knowledge Distillation to align byte representations with teacher embeddings

Byte-Level Supervised Fine-Tuning for end-to-end byte generation

Evaluation: Llama, Qwen, and OLMo

Result: comparable performance with only 125B bytes of data.

2025

NeurIPS 2025

Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access

Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, and Wei Wu

In Proceedings of the 39th Conference on Neural Information Processing Systems, 2025

Abs arXiv Code
We introduce Hierarchical Sparse Attention (HSA) to give RNNs efficient long-range random access without losing linear complexity.

Method: chunk inputs, select top-k chunks, and hierarchically aggregate using token-level relevance; a hardware-aligned kernel keeps it efficient

System: Mamba + HSA (RAMba)

Result: perfect passkey retrieval up to 64M tokens from 4K pre-training, plus strong downstream gains with near-constant memory.