JANGQ-AI/DeepSeek-V4-Flash-JANGTQ

DeepSeek-V4-Flash — JANGTQ (MLX, 2-bit MXTQ TurboQuant)

Premium 2-bit TurboQuant MXTQ codec quantization with per-importance bit allocation. 79 GB. Runs on Mac Studio M3 Ultra at 25.9 tok/s.

Model Details

Property	Value
Base model	`deepseek-ai/DeepSeek-V4-Flash`
Parameters	671 B total, 37 B active per token (256 × 6 of 256 routed experts + shared)
Architecture	DeepseekV4 — MLA + multi-head causal residual + Compressor/Indexer long-ctx
Codec	TurboQuant MXTQ (Lloyd-Max codebook + Hadamard rotation)
Quantization plan	Per-importance: hash-routed L0-L2 at 4-bit MXTQ, smooth-routed L3-L42 at 2-bit MXTQ, non-routed at 8-bit affine gs=32
Runtime	`jang_tools.load_jangtq` + `mlx_lm.generate`
Bundle size	79 GB
Decode	25.91 tok/s sustained on Mac Studio M3 Ultra (200-token greedy)
MMLU 200q (Non-Think, LC=1, max=300)	83.50%
MMLU 200q (mixed-policy: Non-Think + Think on wrongs)	91.50%
MMLU 200q (legacy LC=0, broken extractor — pre-2026-05-01)	69.50% (superseded)

Recipe

Tensor class	Bits	Codec
Routed experts (hash-routed L0-L2)	4-bit	MXTQ codebook
Routed experts (smooth-routed L3-L42)	2-bit	MXTQ codebook
Attention (`wq_a`/`wq_b`/`wkv`/`wo_a`/`wo_b`)	8-bit	affine gs=32
Shared experts	8-bit	affine gs=32
Compressor + Indexer (long-ctx)	8-bit	affine gs=32
`embed_tokens`, `lm_head`	8-bit	affine gs=32
Norms / router gate / mHC	fp16	passthrough

Use

import os
os.environ["JANG_WIRED_LIMIT_GB"] = "160"  # Mac Studio M3 Ultra
# Long context (default ON since 2026-05-01 — worth +7pp MMLU vs SWA-only fallback):
# os.environ["DSV4_LONG_CTX"] = "1"
# Pool cache 4-bit quant (default ON when LC=1, saves ~4 GB at 1M ctx, cosine ≥0.9967):
# os.environ["DSV4_POOL_QUANT"] = "1"

import mlx.core as mx
from jang_tools.load_jangtq import load_jangtq_model
from jang_tools.dsv4.runtime import generate, GenerateOptions

model, tok = load_jangtq_model("OsaurusAI/DeepSeek-V4-Flash-JANGTQ")

# Three modes: chat / think / think_max.
# `chat` is fast, no reasoning. `think` runs `<think>...</think>` chain-of-thought.
result = generate(
    model, tok, "OsaurusAI/DeepSeek-V4-Flash-JANGTQ",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    opts=GenerateOptions(mode="think", max_tokens=512),
)
print("REASONING:", result.reasoning_content[:200])
print("ANSWER:   ", result.content)

Runtime examples

End-to-end Python example scripts in jang-tools/examples/dsv4_flash/:

File	Purpose
`00_verify.py`	Bundle metadata + smoke decode
`01_text_only.py`	All 3 modes (chat / think / think_max)
`02_thinking.py`	Reasoning vs content split + leak audit
`03_tool_calling.py`	DSML tool-call parsing (`｜DSML｜` markers)
`04_long_context.py`	Needle-in-haystack over HSA + CSA path
`05_streaming_generation.py`	Token-streaming reasoning splitter (server pattern)

Bundle comparison (DeepSeek-V4-Flash family)

MMLU 200q stratified across 40 subjects × 5q. Re-measured 2026-05-01 with fixed extractor + DSV4_LONG_CTX=1 + max_tokens=300. Older entries marked _(legacy)_ ran with the broken next(c for c in out) extractor that hit 'C' in "CORRECT" before the actual answer letter — they understate true accuracy by 5-10 pp.

Bundle	Size	MMLU 200q (Non-Think)	MMLU 200q (mixed-policy)	Tok/s
DeepSeek-V4-Flash-JANGTQ (this)	79 GB	83.50 %	91.50 %	25.91
DeepSeek-V4-Flash-JANGTQ2	79.6 GB	70.00% (legacy)	not measured	22.34
DeepSeek-V4-Flash-JANG_2L	107 GB	71.50% (legacy)	not measured	23.77
mlx-community/DeepSeek-V4-Flash-2bit-DQ	90 GB	50.00% (legacy)	not measured	36.03

Mixed-policy = Non-Think first (max=300), then Think-High (max=4096, T=0.6, top_p=0.95) re-asks of the Non-Think wrongs. This matches a real production policy (escalate to Think when confidence is low) and is not a leaderboard hack.

Runtime knob A/B (M4 Max, 128-tok decode)

Variant	Tok/s	vs baseline
`DSV4_LONG_CTX=0` (legacy SWA-only)	19.89	baseline
`DSV4_LONG_CTX=1` + bf16 pool	19.03	-4.3 %
`DSV4_LONG_CTX=1` + 4-bit pool quant (default)	19.79	-0.5 %

Pool-quant cache (default ON when LC=1) closes the speed gap of HSA + CSA all the way to noise — the trained tri-mode attention is effectively free. +7 pp MMLU at 0.5 % speed cost.

HumanEval+ pass@1

Coming soon — comprehensive pass@1 in flight.

Credits

Created by Jinho Jang — eric@jangq.ai

Built on top of DeepSeek-V4-Flash (deepseek-ai).

Distributed via Osaurus AI.

Downloads last month: 6,883

Safetensors

Model size

21B params

Tensor type

U32

I32

F16

I64

MLX

Hardware compatibility

2-bit

Model tree for JANGQ-AI/DeepSeek-V4-Flash-JANGTQ

Base model

deepseek-ai/DeepSeek-V4-Flash

Finetuned

(11)

this model