ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
Abstract
ThriftAttention reduces long-context attention computation by selectively applying higher precision to critical query-key interactions, achieving near-full precision quality at reduced bitwidth efficiency.
Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap. We show ThriftAttention's advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.
Community
Mixed precision attention provides a means to get FP16 output quality at sub-byte inference latency. On long-context evaluation benchmarks, promoting just 5% of the attention computation to FP16 recovers 90% of the performance gap between FP4 and FP16 attention.
that selective fp16 promotion for a tiny fraction of query-key blocks is a neat tightrope between fidelity and throughput. would love to see how robust the 5% block budget is across highly skewed long-context distributions, and whether the identity of the promoted blocks shifts a lot when input changes. the arxivlens breakdown helped me parse the method details, and the online softmax fusion across fp4 and fp16 paths feels almost deceptively simple in practice. if this generalizes to other hardware and even longer contexts, it could be a practical default for mixed-precision attention in real deployments.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention (2026)
- Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference (2026)
- AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference (2026)
- BFLA: Block-Filtered Long-Context Attention Mechanism (2026)
- HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention (2026)
- Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs (2026)
- MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.23081 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper