Papers
arxiv:2603.15031

Attention Residuals

Published on Mar 16
· Submitted by
taesiri
on Mar 17
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Interesting idea 👍

block-attn residuals are the most interesting part here; replacing uniform depth-wise sums with learned, block-level attention feels like the right granularity to keep information flow stable without blasting memory. my main question is how block size and the number of blocks n trade off: did you run ablations across n ∈ {2,4,8} for a fixed depth, and is there a sweet spot where gains persist with modest memory overhead? the arxivLens breakdown helped me parse the details, especially the two-phase compute and the cache-based communication, which otherwise felt easy to underestimate. i’d be curious how the approach behaves on models with highly imbalanced layer budgets or in settings where some layers are more compute-heavy, to see if the learned attention still stabilizes training.

Paper author

@avahal We chose 8 as the default value primarily to align with mHC standards while maximizing the number of blocks. Although increasing the block count generally improves performance, we found that for large-scale LLMs, this number is strictly constrained by communication overhead and memory pressure. Ultimately, 8 represents the optimal balance between these factors.

You can see the impact of various block sizes on the loss in the figure below.

image

found a solid walkthrough of this paper at https://arxivexplained.com/papers/attention-residuals if anyone wants more context, the way they handle residual connections in attention layers is actually pretty clever

Here are the main results from the Attention Residuals (AttnRes) technical report, organized by key findings:

1. Scaling Law Improvements

Figure 4 shows that both Full AttnRes and Block AttnRes consistently outperform the PreNorm baseline across all compute budgets:

  • Compute Efficiency: At 5.6 PFLOP/s-days, Block AttnRes achieves validation loss of 1.692 versus the baseline's 1.714, equivalent to a 1.25× compute advantage (matching the loss of a baseline trained with 25% more compute).
  • Scaling Curves: All variants follow power-law scaling $\mathcal{L} = A \times C^{-\alpha}$, but AttnRes achieves lower intercepts:
    • Baseline: $\mathcal{L} = 1.891 \times C^{-0.057}$
    • Block AttnRes: $\mathcal{L} = 1.870 \times C^{-0.058}$
    • Full AttnRes: $\mathcal{L} = 1.865 \times C^{-0.057}$
Model Size Baseline Block AttnRes (N=8) Full AttnRes
194M act. 1.931 1.909 1.899
436M act. 1.766 1.746 1.737
528M act. 1.719 1.693 1.692

2. Training Dynamics

Figure 5 reveals three critical improvements in training stability:

(a) Validation Loss: AttnRes maintains consistently lower loss throughout training, with the gap widening during the decay phase.

(b) Output Magnitude (PreNorm Dilution Mitigation):

  • Baseline: Shows monotonic $O(L)$ growth in hidden-state magnitudes with depth (PreNorm dilution), forcing deeper layers to produce increasingly large outputs to remain influential.
  • AttnRes: Exhibits a bounded periodic pattern—magnitudes reset at block boundaries due to selective aggregation, preventing the progressive dilution problem.

(c) Gradient Distribution:

  • Baseline shows disproportionately large gradients in early layers
  • AttnRes achieves substantially more uniform gradient norms across depth due to competitive softmax normalization among sources

3. Downstream Task Performance

Table 3: Block AttnRes matches or outperforms the baseline on all 15 evaluated benchmarks, with particularly strong gains on multi-step reasoning:

Category Task Baseline AttnRes Δ
Reasoning GPQA-Diamond 36.9 44.4 +7.5
Math MATH 53.5 57.1 +3.6
Code HumanEval 59.1 62.2 +3.1
Knowledge MMLU 73.5 74.6 +1.1
Chinese C-Eval 79.6 82.5 +2.9

The pattern suggests improved depth-wise information flow particularly benefits compositional tasks where later layers must selectively retrieve earlier representations.

4. Architecture & Ablation Insights

Optimal Architecture Shift (Figure 7): Under fixed compute/parameters, AttnRes shifts the optimal configuration toward deeper, narrower networks:

  • Baseline optimum: $d_{\text{model}}/L_b \approx 60$ (1.847 loss)
  • AttnRes optimum: $d_{\text{model}}/L_b \approx 45$ (1.802 loss)

This indicates AttnRes exploits additional depth more effectively than standard residuals.

Component Ablations (Table 4) on a 16-layer model:

  • Full AttnRes: 1.737 (best)
  • Block AttnRes (S=4): 1.746 (vs. 1.766 baseline)
  • DenseFormer (fixed weights): 1.767 (no improvement)
  • mHC: 1.747
  • Sliding Window (W=8): 1.764 (worse than Block)

Key findings:

  • Input-dependent query (projected from hidden state) achieves 1.731 but adds parameters
  • Removing RMSNorm degrades performance (1.743 vs 1.737)
  • Multi-head depth attention hurts performance (1.752), indicating optimal depth-wise mixing is largely uniform across channels

Learned Attention Patterns (Figure 8):

  • Diagonal dominance: Each layer attends most strongly to its immediate predecessor (locality preserved)
  • Skip connections: Persistent attention to token embedding (source 0) and occasional off-diagonal concentrations indicate learned long-range dependencies
  • Block preservation: Block AttnRes (N=8) recovers the essential structure of Full AttnRes with sharper, more decisive weight distributions

5. Efficiency Results

Memory & Communication:

  • Block AttnRes reduces per-layer memory from $O(Ld)$ to $O(Nd)$ where $N \approx 8$
  • Cross-stage caching under pipeline parallelism reduces communication from $O(C^2)$ to $O(P^2V)$ (where $C$=chunks, $P$=physical stages, $V$=virtual stages)

Overhead:

  • Training: <4% wall-clock overhead under pipeline parallelism; negligible without PP
  • Inference: <2% latency overhead on typical workloads
  • Memory I/O: Block AttnRes requires only $5.5d$ per layer vs. $34d$ for mHC (m=4) and $3d$ for standard residuals (Table 1)

The two-phase computation strategy (Algorithm 1) enables this efficiency by:

  1. Phase 1: Batching inter-block attention across all $S$ layers in a block (amortizing reads)
  2. Phase 2: Sequential intra-block attention with online softmax merging

Summary

AttnRes replaces the fixed uniform accumulation of standard residuals with learned, input-dependent depth-wise attention. Block AttnRes (with ~8 blocks) captures most of the benefit while remaining practical at scale, serving as a drop-in replacement with minimal overhead that consistently improves performance across all model sizes and tasks evaluated.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.15031
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.15031 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.15031 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 20