Attention
updated
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
• 2501.08313
• Published
• 300
Lizard: An Efficient Linearization Framework for Large Language Models
Paper
• 2507.09025
• Published
• 19
On the Expressiveness of Softmax Attention: A Recurrent Neural Network
Perspective
Paper
• 2507.23632
• Published
• 6
Causal Attention with Lookahead Keys
Paper
• 2509.07301
• Published
• 21
Hybrid Architectures for Language Models: Systematic Analysis and Design
Insights
Paper
• 2510.04800
• Published
• 37
Less is More: Recursive Reasoning with Tiny Networks
Paper
• 2510.04871
• Published
• 509
Why Low-Precision Transformer Training Fails: An Analysis on Flash
Attention
Paper
• 2510.04212
• Published
• 26
Reactive Transformer (RxT) -- Stateful Real-Time Processing for
Event-Driven Reactive Language Models
Paper
• 2510.03561
• Published
• 25
Every Attention Matters: An Efficient Hybrid Architecture for
Long-Context Reasoning
Paper
• 2510.19338
• Published
• 115
Kimi Linear: An Expressive, Efficient Attention Architecture
Paper
• 2510.26692
• Published
• 127
DoPE: Denoising Rotary Position Embedding
Paper
• 2511.09146
• Published
• 97
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
Paper
• 2511.20102
• Published
• 28
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Paper
• 2512.08829
• Published
• 21
MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head
Paper
• 2601.07832
• Published
• 52
Reinforced Attention Learning
Paper
• 2602.04884
• Published
• 28
Prism: Spectral-Aware Block-Sparse Attention
Paper
• 2602.08426
• Published
• 36
SLA2: Sparse-Linear Attention with Learnable Routing and QAT
Paper
• 2602.12675
• Published
• 53