DeepSeek-V4 Mini (300M) — randomly-initialized architecture replica

This repository contains a randomly-initialized, small-scale faithful replica of the DeepSeek-V4 architecture (paper: DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence).

It is not a trained model. It exists to serve as:

a faithful reference implementation of every V4 component, at a size that fits on a single consumer GPU,
a starting point for hyperparameter-search and ablation experiments,
a target for weight transfer / slicing from the full-scale V4-Pro / V4-Flash checkpoints in subsequent work.

Architecture components implemented

Component	Status	Notes
mHC (Manifold-Constrained Hyper-Connections)	✅	Sinkhorn-Knopp projection of B onto Birkhoff polytope (20 iters); sigmoid-bounded A,C; dynamic+static parameterization. n_hc=4.
Hybrid attention (CSA / HCA / sliding-window)	✅	Per-layer dispatch via `compress_ratios`. Default 12-layer pattern: `[0,0, 4,32, 4,32, 4,32, 4,32, 4,0]`.
Lightning Indexer	✅	Low-rank queries from shared cQ; ReLU(q·K) head sum; top-k block selection.
Shared-KV MQA + Grouped Output Projection	✅	`wo_a` per-group intermediate, `wo_b` final.
Partial RoPE + output `-i` rotation trick	✅	Last `qk_rope_head_dim` dims rotated; output rotated with negated sin to carry relative position.
Attention sink	✅	Per-head learnable logit added to softmax denominator.
DeepseekMoE with sqrt(softplus) routing	✅	Aux-loss-free top-k via gate bias; norm_topk_prob; routed_scaling_factor.
Hash-routed MoE (first `num_hash_layers` layers)	✅	Token-id → expert table (`tid2eid`).
Clamped SwiGLU	✅	linear ∈ [-limit, limit], gate ≤ limit; limit=10.
MTP module	✅	One V3-style next-token-prediction step with its own attention+MoE+head mHC.
YaRN RoPE scaling to 1M	✅ (config)	factor=16, original=65536.

Quick start (Colab-validated)

The modeling code lives alongside the weights in this same repo (under code/deepseek_v4/). Download the snapshot, register it with HF auto classes, then load via the standard AutoModelForCausalLM API. No trust_remote_code is required.

The tokenizer ships with a baked-in ChatML-style chat_template, so apply_chat_template works out of the box.

# 1) install + auth
# !pip install -q "transformers>=4.50" huggingface_hub safetensors
from huggingface_hub import login, snapshot_download
login()  # private repo

# 2) download repo (weights + tokenizer + modeling package)
local = snapshot_download(repo_id="kshitijthakkar/deepseek-v4-mini-300M-init")

# 3) register the deepseek_v4 model_type with HF auto classes
import sys, os
sys.path.insert(0, os.path.join(local, "code"))
import deepseek_v4  # noqa: F401 — side-effect: AutoConfig / AutoModelForCausalLM registration

# 4) load via the standard Auto API
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained(local)
# Optional override (the same template is already baked into tokenizer_config.json):
# tok.chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>\n'}}{% endfor %}"
model = AutoModelForCausalLM.from_pretrained(local, torch_dtype=torch.float32)
model.eval()

# 5) forward pass on a chat-templated prompt
messages = [{"role": "user", "content": "Hello, who are you?"}]
ids = tok.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True, return_dict=True
)
with torch.no_grad():
    out = model(input_ids=ids["input_ids"])
print("logits:", out.logits.shape)

# 6) greedy generate (output WILL be gibberish — model is randomly initialized)
gen_ids = ids["input_ids"].clone()
with torch.no_grad():
    for _ in range(20):
        nxt = model(input_ids=gen_ids).logits[:, -1].argmax(-1, keepdim=True)
        gen_ids = torch.cat([gen_ids, nxt], dim=1)
        if nxt.item() == tok.eos_token_id:
            break
print(tok.decode(gen_ids[0]))

Note: Output will be gibberish — this is a randomly-initialized model. The validation here is "does it load + forward pass cleanly", not "does it produce coherent text". Coherent text requires training, which is the next step in this collection.

Why a custom import? model_type="deepseek_v4" is not (yet) part of upstream transformers. The import deepseek_v4 line registers the config and model classes with AutoConfig / AutoModelForCausalLM so that from_pretrained resolves to the correct class. No trust_remote_code, no auto_map in config.json.

Configuration

Parameter	Value
hidden_size	512
num_hidden_layers	12
num_attention_heads	8
num_key_value_heads	1 (MQA)
head_dim	64
q_lora_rank / o_lora_rank	256 / 256
qk_rope_head_dim	32
n_routed_experts	16
n_shared_experts	1
num_experts_per_tok	2
num_hash_layers	2
moe_intermediate_size	512
sliding_window	32
max_position_embeddings	1,048,576 (with YaRN factor=16, original=65536)
vocab_size	129280 (real V4-Flash tokenizer)
num_nextn_predict_layers	1
hc_mult (n_hc)	4

Parameter count

Total: ~317 M
Activated per token: ~170 M (top-2 of 16 routed experts)

Tokenizer & chat template

The tokenizer is copied verbatim from deepseek-ai/DeepSeek-V4-Flash, including its chat template. Vocab size: 129280.

Parameter naming

Tensor names match the official V4 safetensors index exactly (flat layout, no model. prefix), so weight transfer / slicing from real V4-Pro / V4-Flash weights into a same-named subset of this scaffold should be straightforward.

Limitations

Random initialization — generates noise tokens; not useful for inference beyond architectural sanity checks.
Not all V4 efficiency tricks are implemented — FP4 expert dtype, FP8 activation quant, batch-invariant kernels, anticipatory routing, on-disk KV storage are out of scope for this small replica.
The CSA path materializes full per-query selected-KV tensors; long-context optimization is left to future versions.

Citation

@misc{deepseek_v4_2026,
  author = {DeepSeek-AI},
  title  = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
  year   = {2026},
  url    = {https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash}
}

Downloads last month: 16

Safetensors

Model size

0.3B params

Tensor type

I64

F32

Collection including kshitijthakkar/deepseek-v4-mini-300M-init

DeepSeek V4 Replicas

Collection

Small-scale faithful replicas of the DeepSeek-V4 architecture for ablation and weight-transfer research. • 6 items • Updated 3 days ago