atlas-nvfp4-moe

NVFP4 fused MoE dispatch + grouped GEMM for Qwen3.6-35B-A3B sparse on NVIDIA GB10 (DGX Spark, SM121).

Ops

Op Use
quantize_bf16_to_nvfp4 Activation packing (BF16 → E2M1 + FP8 block scales)
moe_gate_topk_fused Fused gate GEMM + top-k routing (NVFP4 gate weights)
moe_topk_softmax / _sigmoid Stand-alone top-k variants
moe_permute_tokens Token → expert permutation
moe_silu_mul Gate ⊗ SiLU(up) activation
moe_w4a16_ptrtable_t_k64 NVFP4 grouped GEMM, transposed-B, K-stride 64
moe_w4a16_fused_gate_up_t_k64 Fused gate+up NVFP4 grouped GEMM (centerpiece)

Hardware

GB10 only (sm_121f, compute capability 12.1).

  • Software E2M1 conversion path (SM121 lacks cvt.rn.satfinite.e2m1x2.f32)
  • Ptrtable indirection: one device pointer per expert lets us batch experts of different popularity without a giant fictional B tensor.
  • K=64 tile suffix is tuned for Qwen3.6's hidden_dim=2048, moe_intermediate=512.

Model tested

Model Experts Top-k Hidden Inter
Qwen/Qwen3.6-35B-A3B 256 8 2048 512

License

AGPL-3.0-only.

Downloads last month
8
OS
linux
Arch
aarch64