atlas-nvfp4-moe
NVFP4 fused MoE dispatch + grouped GEMM for Qwen3.6-35B-A3B sparse on NVIDIA GB10 (DGX Spark, SM121).
Ops
| Op | Use |
|---|---|
quantize_bf16_to_nvfp4 |
Activation packing (BF16 → E2M1 + FP8 block scales) |
moe_gate_topk_fused |
Fused gate GEMM + top-k routing (NVFP4 gate weights) |
moe_topk_softmax / _sigmoid |
Stand-alone top-k variants |
moe_permute_tokens |
Token → expert permutation |
moe_silu_mul |
Gate ⊗ SiLU(up) activation |
moe_w4a16_ptrtable_t_k64 |
NVFP4 grouped GEMM, transposed-B, K-stride 64 |
moe_w4a16_fused_gate_up_t_k64 |
Fused gate+up NVFP4 grouped GEMM (centerpiece) |
Hardware
GB10 only (sm_121f, compute capability 12.1).
- Software E2M1 conversion path (SM121 lacks
cvt.rn.satfinite.e2m1x2.f32) - Ptrtable indirection: one device pointer per expert lets us batch experts of different popularity without a giant fictional B tensor.
- K=64 tile suffix is tuned for Qwen3.6's hidden_dim=2048, moe_intermediate=512.
Model tested
| Model | Experts | Top-k | Hidden | Inter |
|---|---|---|---|---|
| Qwen/Qwen3.6-35B-A3B | 256 | 8 | 2048 | 512 |
License
AGPL-3.0-only.
- Downloads last month
- 8
- OS
- linux
- Arch
- aarch64