Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access
In Proceedings of the 39th Conference on Neural Information Processing Systems, 2025
We introduce Hierarchical Sparse Attention (HSA) to give RNNs efficient long-range random access without losing linear complexity.
- Method: chunk inputs, select top-k chunks, and hierarchically aggregate using token-level relevance; a hardware-aligned kernel keeps it efficient
- System: Mamba + HSA (RAMba)
- Result: perfect passkey retrieval up to 64M tokens from 4K pre-training, plus strong downstream gains with near-constant memory.