Osaurus AI

DeepSeek-V4-Flash — JANGTQ (MLX, 2-bit MXTQ TurboQuant)

Premium 2-bit TurboQuant MXTQ codec quantization with per-importance bit allocation. 79 GB. Runs on Mac Studio M3 Ultra at 25.9 tok/s.

Website  OsaurusAI


Model Details

Property Value
Base model deepseek-ai/DeepSeek-V4-Flash
Parameters 671 B total, 37 B active per token (256 × 6 of 256 routed experts + shared)
Architecture DeepseekV4 — MLA + multi-head causal residual + Compressor/Indexer long-ctx
Codec TurboQuant MXTQ (Lloyd-Max codebook + Hadamard rotation)
Quantization plan Per-importance: hash-routed L0-L2 at 4-bit MXTQ, smooth-routed L3-L42 at 2-bit MXTQ, non-routed at 8-bit affine gs=32
Runtime jang_tools.load_jangtq + mlx_lm.generate
Bundle size 79 GB
Decode 25.91 tok/s sustained on Mac Studio M3 Ultra (200-token greedy)
MMLU 200q (Non-Think, LC=1, max=300) 83.50%
MMLU 200q (mixed-policy: Non-Think + Think on wrongs) 91.50%
MMLU 200q (legacy LC=0, broken extractor — pre-2026-05-01) 69.50% (superseded)

Recipe

Tensor class Bits Codec
Routed experts (hash-routed L0-L2) 4-bit MXTQ codebook
Routed experts (smooth-routed L3-L42) 2-bit MXTQ codebook
Attention (wq_a/wq_b/wkv/wo_a/wo_b) 8-bit affine gs=32
Shared experts 8-bit affine gs=32
Compressor + Indexer (long-ctx) 8-bit affine gs=32
embed_tokens, lm_head 8-bit affine gs=32
Norms / router gate / mHC fp16 passthrough

Use

import os
os.environ["JANG_WIRED_LIMIT_GB"] = "160"  # Mac Studio M3 Ultra
# Long context (default ON since 2026-05-01 — worth +7pp MMLU vs SWA-only fallback):
# os.environ["DSV4_LONG_CTX"] = "1"
# Pool cache 4-bit quant (default ON when LC=1, saves ~4 GB at 1M ctx, cosine ≥0.9967):
# os.environ["DSV4_POOL_QUANT"] = "1"

import mlx.core as mx
from jang_tools.load_jangtq import load_jangtq_model
from jang_tools.dsv4.runtime import generate, GenerateOptions

model, tok = load_jangtq_model("OsaurusAI/DeepSeek-V4-Flash-JANGTQ")

# Three modes: chat / think / think_max.
# `chat` is fast, no reasoning. `think` runs `<think>...</think>` chain-of-thought.
result = generate(
    model, tok, "OsaurusAI/DeepSeek-V4-Flash-JANGTQ",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    opts=GenerateOptions(mode="think", max_tokens=512),
)
print("REASONING:", result.reasoning_content[:200])
print("ANSWER:   ", result.content)

Runtime examples

End-to-end Python example scripts in jang-tools/examples/dsv4_flash/:

File Purpose
00_verify.py Bundle metadata + smoke decode
01_text_only.py All 3 modes (chat / think / think_max)
02_thinking.py Reasoning vs content split + leak audit
03_tool_calling.py DSML tool-call parsing (|DSML| markers)
04_long_context.py Needle-in-haystack over HSA + CSA path
05_streaming_generation.py Token-streaming reasoning splitter (server pattern)

Bundle comparison (DeepSeek-V4-Flash family)

MMLU 200q stratified across 40 subjects × 5q. Re-measured 2026-05-01 with fixed extractor + DSV4_LONG_CTX=1 + max_tokens=300. Older entries marked _(legacy)_ ran with the broken next(c for c in out) extractor that hit 'C' in "CORRECT" before the actual answer letter — they understate true accuracy by 5-10 pp.

Bundle Size MMLU 200q (Non-Think) MMLU 200q (mixed-policy) Tok/s
DeepSeek-V4-Flash-JANGTQ (this) 79 GB 83.50 % 91.50 % 25.91
DeepSeek-V4-Flash-JANGTQ2 79.6 GB 70.00% (legacy) not measured 22.34
DeepSeek-V4-Flash-JANG_2L 107 GB 71.50% (legacy) not measured 23.77
mlx-community/DeepSeek-V4-Flash-2bit-DQ 90 GB 50.00% (legacy) not measured 36.03

Mixed-policy = Non-Think first (max=300), then Think-High (max=4096, T=0.6, top_p=0.95) re-asks of the Non-Think wrongs. This matches a real production policy (escalate to Think when confidence is low) and is not a leaderboard hack.

Runtime knob A/B (M4 Max, 128-tok decode)

Variant Tok/s vs baseline
DSV4_LONG_CTX=0 (legacy SWA-only) 19.89 baseline
DSV4_LONG_CTX=1 + bf16 pool 19.03 -4.3 %
DSV4_LONG_CTX=1 + 4-bit pool quant (default) 19.79 -0.5 %

Pool-quant cache (default ON when LC=1) closes the speed gap of HSA + CSA all the way to noise — the trained tri-mode attention is effectively free. +7 pp MMLU at 0.5 % speed cost.

HumanEval+ pass@1

Coming soon — comprehensive pass@1 in flight.

Credits

Created by Jinho Jang — eric@jangq.ai

Built on top of DeepSeek-V4-Flash (deepseek-ai).

Distributed via Osaurus AI.

Downloads last month
6,883
Safetensors
Model size
21B params
Tensor type
U32
·
I32
·
F16
·
I64
·
MLX
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/DeepSeek-V4-Flash-JANGTQ

Finetuned
(11)
this model