atlas-gdn

Hand-tuned Gated DeltaNet kernels for the linear-attention path of Qwen3.6 hybrid models on NVIDIA GB10 (DGX Spark, SM121).

What's inside

Op Use
gdn_decode Single-token recurrent decode (FP32 Q/K/V, BF16 out)
gdn_prefill Multi-token prefill (BF16 throughout)
gdn_chunk2 / gdn_chunk3 MTP K=2/3 chunkwise verify (Qwen3.6 NVFP4 specialized)
gdn_wy2 / wy3 / wy4 2-pass WY-chunkwise verify (general K=2/3/4)
causal_conv1d_fwd Depthwise causal Conv1d (SSM input projection)
causal_conv1d_update Single-step Conv1d update (decode)

Hardware

These kernels target only NVIDIA GB10 (compute capability 12.1, sm_121f). They will not load on any other GPU. GB10 has:

  • Unified LPDDR5X memory (~273 GB/s) — bandwidth-bound, not occupancy-bound
  • No multi-CTA clusters (ClusterShape forced to 1×1×1)
  • No cvt.rn.satfinite.e2m1x2.f32 PTX (software E2M1 conversion path)
  • Cooperative-only scheduling (no Pingpong)

build.toml pins cuda-capabilities = ["12.1"] so the build matrix yields a single SM121 binary; no fallback binaries are produced.

Models tested

Model Layers using these kernels
Qwen/Qwen3.6-27B (dense, hybrid) 48 GDN layers
Qwen/Qwen3.6-35B-A3B (sparse MoE, hybrid) 30 GDN layers

Provenance

Sources are extracted from the Atlas inference engine (https://github.com/Avarok-Cybersecurity/atlas, AGPL-3.0). The GDN NVFP4 variant ships with __launch_bounds__ annotations specific to Qwen3.6 hidden dimensions (k_dim=128, v_dim=128, 16/32 K/V heads).

License

AGPL-3.0-only.

Downloads last month
8
OS
linux
Arch
aarch64