atlas-gdn

Hand-tuned Gated DeltaNet kernels for the linear-attention path of Qwen3.6 hybrid models on NVIDIA GB10 (DGX Spark, SM121).

What's inside

Op	Use
`gdn_decode`	Single-token recurrent decode (FP32 Q/K/V, BF16 out)
`gdn_prefill`	Multi-token prefill (BF16 throughout)
`gdn_chunk2` / `gdn_chunk3`	MTP K=2/3 chunkwise verify (Qwen3.6 NVFP4 specialized)
`gdn_wy2` / `wy3` / `wy4`	2-pass WY-chunkwise verify (general K=2/3/4)
`causal_conv1d_fwd`	Depthwise causal Conv1d (SSM input projection)
`causal_conv1d_update`	Single-step Conv1d update (decode)

Hardware

These kernels target only NVIDIA GB10 (compute capability 12.1, sm_121f). They will not load on any other GPU. GB10 has:

Unified LPDDR5X memory (~273 GB/s) — bandwidth-bound, not occupancy-bound
No multi-CTA clusters (ClusterShape forced to 1×1×1)
No cvt.rn.satfinite.e2m1x2.f32 PTX (software E2M1 conversion path)
Cooperative-only scheduling (no Pingpong)

build.toml pins cuda-capabilities = ["12.1"] so the build matrix yields a single SM121 binary; no fallback binaries are produced.

Models tested

Model	Layers using these kernels
Qwen/Qwen3.6-27B (dense, hybrid)	48 GDN layers
Qwen/Qwen3.6-35B-A3B (sparse MoE, hybrid)	30 GDN layers

Provenance

Sources are extracted from the Atlas inference engine (https://github.com/Avarok-Cybersecurity/atlas, AGPL-3.0). The GDN NVFP4 variant ships with __launch_bounds__ annotations specific to Qwen3.6 hidden dimensions (k_dim=128, v_dim=128, 16/32 K/V heads).

License

AGPL-3.0-only.

Downloads last month: 8

OS: linux

Arch: aarch64