Kernels:

Atlas-Inference
/

nvfp4-moe

Kernel card Files Files and versions

atlas-nvfp4-moe

NVFP4 fused MoE dispatch + grouped GEMM for Qwen3.6-35B-A3B sparse on NVIDIA GB10 (DGX Spark, SM121).

Ops

Op	Use
`quantize_bf16_to_nvfp4`	Activation packing (BF16 → E2M1 + FP8 block scales)
`moe_gate_topk_fused`	Fused gate GEMM + top-k routing (NVFP4 gate weights)
`moe_topk_softmax` / `_sigmoid`	Stand-alone top-k variants
`moe_permute_tokens`	Token → expert permutation
`moe_silu_mul`	Gate ⊗ SiLU(up) activation
`moe_w4a16_ptrtable_t_k64`	NVFP4 grouped GEMM, transposed-B, K-stride 64
`moe_w4a16_fused_gate_up_t_k64`	Fused gate+up NVFP4 grouped GEMM (centerpiece)

Hardware

GB10 only (sm_121f, compute capability 12.1).

Software E2M1 conversion path (SM121 lacks cvt.rn.satfinite.e2m1x2.f32)
Ptrtable indirection: one device pointer per expert lets us batch experts of different popularity without a giant fictional B tensor.
K=64 tile suffix is tuned for Qwen3.6's hidden_dim=2048, moe_intermediate=512.

Model tested

Model	Experts	Top-k	Hidden	Inter
Qwen/Qwen3.6-35B-A3B	256	8	2048	512

License

AGPL-3.0-only.

Downloads last month: 8

OS: linux

Arch: aarch64