DeepSeek-V4-Flash — JANGTQ (MLX, 2-bit MXTQ TurboQuant)
Premium 2-bit TurboQuant MXTQ codec quantization with per-importance bit allocation. 79 GB. Runs on Mac Studio M3 Ultra at 25.9 tok/s.
Model Details
| Property | Value |
|---|---|
| Base model | deepseek-ai/DeepSeek-V4-Flash |
| Parameters | 671 B total, 37 B active per token (256 × 6 of 256 routed experts + shared) |
| Architecture | DeepseekV4 — MLA + multi-head causal residual + Compressor/Indexer long-ctx |
| Codec | TurboQuant MXTQ (Lloyd-Max codebook + Hadamard rotation) |
| Quantization plan | Per-importance: hash-routed L0-L2 at 4-bit MXTQ, smooth-routed L3-L42 at 2-bit MXTQ, non-routed at 8-bit affine gs=32 |
| Runtime | jang_tools.load_jangtq + mlx_lm.generate |
| Bundle size | 79 GB |
| Decode | 25.91 tok/s sustained on Mac Studio M3 Ultra (200-token greedy) |
| MMLU 200q (Non-Think, LC=1, max=300) | 83.50% |
| MMLU 200q (mixed-policy: Non-Think + Think on wrongs) | 91.50% |
| MMLU 200q (legacy LC=0, broken extractor — pre-2026-05-01) | 69.50% (superseded) |
Recipe
| Tensor class | Bits | Codec |
|---|---|---|
| Routed experts (hash-routed L0-L2) | 4-bit | MXTQ codebook |
| Routed experts (smooth-routed L3-L42) | 2-bit | MXTQ codebook |
Attention (wq_a/wq_b/wkv/wo_a/wo_b) |
8-bit | affine gs=32 |
| Shared experts | 8-bit | affine gs=32 |
| Compressor + Indexer (long-ctx) | 8-bit | affine gs=32 |
embed_tokens, lm_head |
8-bit | affine gs=32 |
| Norms / router gate / mHC | fp16 | passthrough |
Use
import os
os.environ["JANG_WIRED_LIMIT_GB"] = "160" # Mac Studio M3 Ultra
# Long context (default ON since 2026-05-01 — worth +7pp MMLU vs SWA-only fallback):
# os.environ["DSV4_LONG_CTX"] = "1"
# Pool cache 4-bit quant (default ON when LC=1, saves ~4 GB at 1M ctx, cosine ≥0.9967):
# os.environ["DSV4_POOL_QUANT"] = "1"
import mlx.core as mx
from jang_tools.load_jangtq import load_jangtq_model
from jang_tools.dsv4.runtime import generate, GenerateOptions
model, tok = load_jangtq_model("OsaurusAI/DeepSeek-V4-Flash-JANGTQ")
# Three modes: chat / think / think_max.
# `chat` is fast, no reasoning. `think` runs `<think>...</think>` chain-of-thought.
result = generate(
model, tok, "OsaurusAI/DeepSeek-V4-Flash-JANGTQ",
messages=[{"role": "user", "content": "What is 2+2?"}],
opts=GenerateOptions(mode="think", max_tokens=512),
)
print("REASONING:", result.reasoning_content[:200])
print("ANSWER: ", result.content)
Runtime examples
End-to-end Python example scripts in jang-tools/examples/dsv4_flash/:
| File | Purpose |
|---|---|
00_verify.py |
Bundle metadata + smoke decode |
01_text_only.py |
All 3 modes (chat / think / think_max) |
02_thinking.py |
Reasoning vs content split + leak audit |
03_tool_calling.py |
DSML tool-call parsing (|DSML| markers) |
04_long_context.py |
Needle-in-haystack over HSA + CSA path |
05_streaming_generation.py |
Token-streaming reasoning splitter (server pattern) |
Bundle comparison (DeepSeek-V4-Flash family)
MMLU 200q stratified across 40 subjects × 5q. Re-measured 2026-05-01 with fixed extractor + DSV4_LONG_CTX=1 + max_tokens=300. Older entries marked _(legacy)_ ran with the broken next(c for c in out) extractor that hit 'C' in "CORRECT" before the actual answer letter — they understate true accuracy by 5-10 pp.
| Bundle | Size | MMLU 200q (Non-Think) | MMLU 200q (mixed-policy) | Tok/s |
|---|---|---|---|---|
| DeepSeek-V4-Flash-JANGTQ (this) | 79 GB | 83.50 % | 91.50 % | 25.91 |
| DeepSeek-V4-Flash-JANGTQ2 | 79.6 GB | 70.00% (legacy) | not measured | 22.34 |
| DeepSeek-V4-Flash-JANG_2L | 107 GB | 71.50% (legacy) | not measured | 23.77 |
| mlx-community/DeepSeek-V4-Flash-2bit-DQ | 90 GB | 50.00% (legacy) | not measured | 36.03 |
Mixed-policy = Non-Think first (max=300), then Think-High (max=4096, T=0.6, top_p=0.95) re-asks of the Non-Think wrongs. This matches a real production policy (escalate to Think when confidence is low) and is not a leaderboard hack.
Runtime knob A/B (M4 Max, 128-tok decode)
| Variant | Tok/s | vs baseline |
|---|---|---|
DSV4_LONG_CTX=0 (legacy SWA-only) |
19.89 | baseline |
DSV4_LONG_CTX=1 + bf16 pool |
19.03 | -4.3 % |
DSV4_LONG_CTX=1 + 4-bit pool quant (default) |
19.79 | -0.5 % |
Pool-quant cache (default ON when LC=1) closes the speed gap of HSA + CSA all the way to noise — the trained tri-mode attention is effectively free. +7 pp MMLU at 0.5 % speed cost.
HumanEval+ pass@1
Coming soon — comprehensive pass@1 in flight.
Credits
Created by Jinho Jang — eric@jangq.ai
Built on top of DeepSeek-V4-Flash (deepseek-ai).
Distributed via Osaurus AI.
- Downloads last month
- 6,883
2-bit
Model tree for JANGQ-AI/DeepSeek-V4-Flash-JANGTQ
Base model
deepseek-ai/DeepSeek-V4-Flash