DeepSeek-V3.2-CPU-NUMA4-AMXINT8
deepseek-ai/DeepSeek-V3.2 quantized to the AMXINT8 format for inference with sglang + ktransformers, packed specifically for inference on 4 NUMA nodes.
To run, please ensure that your CPU supports the AMX instruction set (Intel Xeon processor, Sapphire Rapids or newer), and make note of your NUMA node count. Install kt-kernal and sglang-kt following the official documentation.
Then, download the official weights of deepseek-ai/DeepSeek-V3.2 in FP8, as well as this CPU-optimized quantized model, and prepare your launch command:
PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
python -m sglang.launch_server \
--model /path/to/DeepSeek-V3.2 \
--kt-method AMXINT8 \
--kt-weight-path /path/to/DeepSeek-V3.2-CPU-NUMA4-AMXINT8 \
--kt-cpuinfer 128 \
--kt-threadpool-count 4 \
--kt-num-gpu-experts 16 \
--kt-max-deferred-experts-per-token 0 \
--kt-expert-placement-strategy uniform \
--trust-remote-code \
--mem-fraction-static 0.98 \
--served-model-name deepseek-ai/DeepSeek-V3.2 \
--enable-mixed-chunk \
--tensor-parallel-size 1 \
--enable-p2p-check \
--disable-shared-experts-fusion \
--chunked-prefill-size 4096 \
--context-length 131072 \
--max-total-tokens 131072 \
--max-running-requests 1 \
--attention-backend flashinfer \
--fp8-gemm-backend cutlass \
--kv-cache-dtype bf16 \
--reasoning-parser deepseek-v3 \
--tool-call-parser deepseekv32
Notes:
- Note that DSA (DeepSeek Sparse Attention) is not currently supported on non-enterprise GPU architectures, so attention will fall back to standard MLA with the specified
--attention-backend --kt-cpuinfershould be set to the total number of physical CPU cores across all NUMA nodes--tensor-parallel-size 1should be set to the number of GPUs- The optimal choices for
--attention-backendand--fp8-gemm-backenddepend on the CUDA architecture of your GPUs - please check the sglang documentation --kt-num-gpu-experts,--mem-fraction-static,--chunked-prefill-size,--context-length,--max-total-tokens, and--max-running-requestsshould be adjusted depending on constraints of your hardware- Please review the official
kt-kerneldocumentation for details
- Downloads last month
- 13
Model tree for CPU-Hybrid-MoE/DeepSeek-V3.2-CPU-NUMA4-AMXINT8
Base model
deepseek-ai/DeepSeek-V3.2-Exp-Base Finetuned
deepseek-ai/DeepSeek-V3.2