Qwen3.5-122B-A10B APEX GGUF

APEX (Adaptive Precision for EXpert Models) quantizations of Qwen3.5-122B-A10B.

Brought to you by the LocalAI team | APEX Project | Technical Report

Benchmark Results

All measurements on 8xRTX PRO 6000 Blackwell (768 GB VRAM). Perplexity on wikitext-2-raw, context 512. Accuracy benchmarks via llama.cpp (400 tasks each).

Configuration Size (GB) Perplexity KL mean HellaSwag Winogrande MMLU ARC tg128 (t/s)
Q8_0 (Unsloth) 121 4.819 0.004 85.5% 77.3% 44.19 57.19 85.5
Q5_K_S (Unsloth) ~81 4.826 0.007 85.3% 76.0% 43.80 57.86 90.4
UD-Q4_K_XL (Unsloth) ~72 4.829 0.010 84.8% 76.3% 44.25 55.85 91.8
APEX I-Balanced 83.4 4.831 0.008 85.5% 77.8% 43.86 57.86 96.7
APEX I-Quality 72.3 4.838 0.012 85.3% 77.3% 43.86 56.86 99.7
APEX Quality 72.3 4.848 0.013 85.5% 76.3% 44.44 55.52 99.8
APEX Balanced 83.4 4.840 0.008 85.0% 76.3% 43.93 56.86 96.7
APEX I-Compact 55.1 4.978 0.041 84.5% 77.5% 44.06 57.86 106.3
APEX Compact 55.1 5.046 0.049 84.5% 77.8% 43.54 56.19 106.2
APEX I-Mini 44.9 5.306 0.102 84.0% 75.3% 42.83 56.52 110.0

Highlights

  • APEX I-Balanced matches or beats Q8_0 on HellaSwag (85.5%), Winogrande (77.8% vs 77.3%), and ARC (57.86 vs 57.19) while being 31% smaller and 13% faster.
  • APEX I-Quality (72.3 GB) beats UD-Q4_K_XL at the same size on HellaSwag (85.3% vs 84.8%), Winogrande (77.3% vs 76.3%), and ARC (56.86 vs 55.85).
  • APEX I-Compact (55.1 GB) achieves 84.5% HellaSwag and 57.86 ARC at 55% less size than Q8_0 — fastest standard profile at 106 t/s.
  • APEX I-Mini (44.9 GB) is the smallest at 63% less size than Q8_0, still 84% HellaSwag, fastest at 110 t/s.
  • I-variants consistently improve over standard profiles across PPL, KL, and ARC.

Available Files

File Profile Size Best For
Qwen3.5-122B-A10B-APEX-I-Balanced.gguf I-Balanced 83.4 GB Best overall -- matches Q8_0 quality at 31% less size
Qwen3.5-122B-A10B-APEX-I-Quality.gguf I-Quality 72.3 GB Best quality at ~72 GB tier
Qwen3.5-122B-A10B-APEX-Quality.gguf Quality 72.3 GB Highest MMLU (44.44)
Qwen3.5-122B-A10B-APEX-Balanced.gguf Balanced 83.4 GB General purpose, low KL
Qwen3.5-122B-A10B-APEX-I-Compact.gguf I-Compact 55.1 GB Consumer multi-GPU, best quality/size ratio
Qwen3.5-122B-A10B-APEX-Compact.gguf Compact 55.1 GB Consumer multi-GPU setups
Qwen3.5-122B-A10B-APEX-I-Mini.gguf I-Mini 44.9 GB Smallest viable, fastest inference

What is APEX?

APEX is a quantization strategy for Mixture-of-Experts (MoE) models. It classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient -- edge layers get higher precision, middle layers get more aggressive compression. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia).

See the APEX project for full details, technical report, and scripts.

Architecture

  • Model: Qwen3.5-122B-A10B (Qwen3.5-MoE)
  • Layers: 48
  • Experts: 256 routed + 1 shared (8 active per token)
  • Total Parameters: 122B
  • Active Parameters: ~10B per token
  • APEX Config: 5+5 symmetric edge gradient across 48 layers

Run with LocalAI

local-ai run mudler/Qwen3.5-122B-A10B-APEX-GGUF@Qwen3.5-122B-A10B-APEX-I-Balanced.gguf

Credits

APEX is brought to you by the LocalAI team. Developed through human-driven, AI-assisted research. Built on llama.cpp.

Downloads last month
5,267
GGUF
Model size
122B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mudler/Qwen3.5-122B-A10B-APEX-GGUF

Quantized
(90)
this model

Collection including mudler/Qwen3.5-122B-A10B-APEX-GGUF