DeepSeek-V4 Mini (300M) — randomly-initialized architecture replica

This repository contains a randomly-initialized, small-scale faithful replica of the DeepSeek-V4 architecture (paper: DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence).

It is not a trained model. It exists to serve as:

  • a faithful reference implementation of every V4 component, at a size that fits on a single consumer GPU,
  • a starting point for hyperparameter-search and ablation experiments,
  • a target for weight transfer / slicing from the full-scale V4-Pro / V4-Flash checkpoints in subsequent work.

Architecture components implemented

Component Status Notes
mHC (Manifold-Constrained Hyper-Connections) Sinkhorn-Knopp projection of B onto Birkhoff polytope (20 iters); sigmoid-bounded A,C; dynamic+static parameterization. n_hc=4.
Hybrid attention (CSA / HCA / sliding-window) Per-layer dispatch via compress_ratios. Default 12-layer pattern: [0,0, 4,32, 4,32, 4,32, 4,32, 4,0].
Lightning Indexer Low-rank queries from shared cQ; ReLU(q·K) head sum; top-k block selection.
Shared-KV MQA + Grouped Output Projection wo_a per-group intermediate, wo_b final.
Partial RoPE + output -i rotation trick Last qk_rope_head_dim dims rotated; output rotated with negated sin to carry relative position.
Attention sink Per-head learnable logit added to softmax denominator.
DeepseekMoE with sqrt(softplus) routing Aux-loss-free top-k via gate bias; norm_topk_prob; routed_scaling_factor.
Hash-routed MoE (first num_hash_layers layers) Token-id → expert table (tid2eid).
Clamped SwiGLU linear ∈ [-limit, limit], gate ≤ limit; limit=10.
MTP module One V3-style next-token-prediction step with its own attention+MoE+head mHC.
YaRN RoPE scaling to 1M ✅ (config) factor=16, original=65536.

Quick start (Colab-validated)

The modeling code lives alongside the weights in this same repo (under code/deepseek_v4/). Download the snapshot, register it with HF auto classes, then load via the standard AutoModelForCausalLM API. No trust_remote_code is required.

The tokenizer ships with a baked-in ChatML-style chat_template, so apply_chat_template works out of the box.

# 1) install + auth
# !pip install -q "transformers>=4.50" huggingface_hub safetensors
from huggingface_hub import login, snapshot_download
login()  # private repo

# 2) download repo (weights + tokenizer + modeling package)
local = snapshot_download(repo_id="kshitijthakkar/deepseek-v4-mini-300M-init")

# 3) register the deepseek_v4 model_type with HF auto classes
import sys, os
sys.path.insert(0, os.path.join(local, "code"))
import deepseek_v4  # noqa: F401 — side-effect: AutoConfig / AutoModelForCausalLM registration

# 4) load via the standard Auto API
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained(local)
# Optional override (the same template is already baked into tokenizer_config.json):
# tok.chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>\n'}}{% endfor %}"
model = AutoModelForCausalLM.from_pretrained(local, torch_dtype=torch.float32)
model.eval()

# 5) forward pass on a chat-templated prompt
messages = [{"role": "user", "content": "Hello, who are you?"}]
ids = tok.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True, return_dict=True
)
with torch.no_grad():
    out = model(input_ids=ids["input_ids"])
print("logits:", out.logits.shape)

# 6) greedy generate (output WILL be gibberish — model is randomly initialized)
gen_ids = ids["input_ids"].clone()
with torch.no_grad():
    for _ in range(20):
        nxt = model(input_ids=gen_ids).logits[:, -1].argmax(-1, keepdim=True)
        gen_ids = torch.cat([gen_ids, nxt], dim=1)
        if nxt.item() == tok.eos_token_id:
            break
print(tok.decode(gen_ids[0]))

Note: Output will be gibberish — this is a randomly-initialized model. The validation here is "does it load + forward pass cleanly", not "does it produce coherent text". Coherent text requires training, which is the next step in this collection.

Why a custom import? model_type="deepseek_v4" is not (yet) part of upstream transformers. The import deepseek_v4 line registers the config and model classes with AutoConfig / AutoModelForCausalLM so that from_pretrained resolves to the correct class. No trust_remote_code, no auto_map in config.json.

Configuration

Parameter Value
hidden_size 512
num_hidden_layers 12
num_attention_heads 8
num_key_value_heads 1 (MQA)
head_dim 64
q_lora_rank / o_lora_rank 256 / 256
qk_rope_head_dim 32
n_routed_experts 16
n_shared_experts 1
num_experts_per_tok 2
num_hash_layers 2
moe_intermediate_size 512
sliding_window 32
max_position_embeddings 1,048,576 (with YaRN factor=16, original=65536)
vocab_size 129280 (real V4-Flash tokenizer)
num_nextn_predict_layers 1
hc_mult (n_hc) 4

Parameter count

  • Total: ~317 M
  • Activated per token: ~170 M (top-2 of 16 routed experts)

Tokenizer & chat template

The tokenizer is copied verbatim from deepseek-ai/DeepSeek-V4-Flash, including its chat template. Vocab size: 129280.

Parameter naming

Tensor names match the official V4 safetensors index exactly (flat layout, no model. prefix), so weight transfer / slicing from real V4-Pro / V4-Flash weights into a same-named subset of this scaffold should be straightforward.

Limitations

  • Random initialization — generates noise tokens; not useful for inference beyond architectural sanity checks.
  • Not all V4 efficiency tricks are implemented — FP4 expert dtype, FP8 activation quant, batch-invariant kernels, anticipatory routing, on-disk KV storage are out of scope for this small replica.
  • The CSA path materializes full per-query selected-KV tensors; long-context optimization is left to future versions.

Citation

@misc{deepseek_v4_2026,
  author = {DeepSeek-AI},
  title  = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
  year   = {2026},
  url    = {https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash}
}
Downloads last month
16
Safetensors
Model size
0.3B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including kshitijthakkar/deepseek-v4-mini-300M-init