DeepSeek-V4 Mini (300M) — randomly-initialized architecture replica
This repository contains a randomly-initialized, small-scale faithful replica of the DeepSeek-V4 architecture (paper: DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence).
It is not a trained model. It exists to serve as:
- a faithful reference implementation of every V4 component, at a size that fits on a single consumer GPU,
- a starting point for hyperparameter-search and ablation experiments,
- a target for weight transfer / slicing from the full-scale V4-Pro / V4-Flash checkpoints in subsequent work.
Architecture components implemented
| Component | Status | Notes |
|---|---|---|
| mHC (Manifold-Constrained Hyper-Connections) | ✅ | Sinkhorn-Knopp projection of B onto Birkhoff polytope (20 iters); sigmoid-bounded A,C; dynamic+static parameterization. n_hc=4. |
| Hybrid attention (CSA / HCA / sliding-window) | ✅ | Per-layer dispatch via compress_ratios. Default 12-layer pattern: [0,0, 4,32, 4,32, 4,32, 4,32, 4,0]. |
| Lightning Indexer | ✅ | Low-rank queries from shared cQ; ReLU(q·K) head sum; top-k block selection. |
| Shared-KV MQA + Grouped Output Projection | ✅ | wo_a per-group intermediate, wo_b final. |
Partial RoPE + output -i rotation trick |
✅ | Last qk_rope_head_dim dims rotated; output rotated with negated sin to carry relative position. |
| Attention sink | ✅ | Per-head learnable logit added to softmax denominator. |
| DeepseekMoE with sqrt(softplus) routing | ✅ | Aux-loss-free top-k via gate bias; norm_topk_prob; routed_scaling_factor. |
Hash-routed MoE (first num_hash_layers layers) |
✅ | Token-id → expert table (tid2eid). |
| Clamped SwiGLU | ✅ | linear ∈ [-limit, limit], gate ≤ limit; limit=10. |
| MTP module | ✅ | One V3-style next-token-prediction step with its own attention+MoE+head mHC. |
| YaRN RoPE scaling to 1M | ✅ (config) | factor=16, original=65536. |
Quick start (Colab-validated)
The modeling code lives alongside the weights in this same repo (under
code/deepseek_v4/). Download the snapshot, register it with HF auto classes,
then load via the standard AutoModelForCausalLM API. No trust_remote_code
is required.
The tokenizer ships with a baked-in ChatML-style chat_template, so
apply_chat_template works out of the box.
# 1) install + auth
# !pip install -q "transformers>=4.50" huggingface_hub safetensors
from huggingface_hub import login, snapshot_download
login() # private repo
# 2) download repo (weights + tokenizer + modeling package)
local = snapshot_download(repo_id="kshitijthakkar/deepseek-v4-mini-300M-init")
# 3) register the deepseek_v4 model_type with HF auto classes
import sys, os
sys.path.insert(0, os.path.join(local, "code"))
import deepseek_v4 # noqa: F401 — side-effect: AutoConfig / AutoModelForCausalLM registration
# 4) load via the standard Auto API
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained(local)
# Optional override (the same template is already baked into tokenizer_config.json):
# tok.chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>\n'}}{% endfor %}"
model = AutoModelForCausalLM.from_pretrained(local, torch_dtype=torch.float32)
model.eval()
# 5) forward pass on a chat-templated prompt
messages = [{"role": "user", "content": "Hello, who are you?"}]
ids = tok.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True, return_dict=True
)
with torch.no_grad():
out = model(input_ids=ids["input_ids"])
print("logits:", out.logits.shape)
# 6) greedy generate (output WILL be gibberish — model is randomly initialized)
gen_ids = ids["input_ids"].clone()
with torch.no_grad():
for _ in range(20):
nxt = model(input_ids=gen_ids).logits[:, -1].argmax(-1, keepdim=True)
gen_ids = torch.cat([gen_ids, nxt], dim=1)
if nxt.item() == tok.eos_token_id:
break
print(tok.decode(gen_ids[0]))
Note: Output will be gibberish — this is a randomly-initialized model. The validation here is "does it load + forward pass cleanly", not "does it produce coherent text". Coherent text requires training, which is the next step in this collection.
Why a custom import?
model_type="deepseek_v4"is not (yet) part of upstreamtransformers. Theimport deepseek_v4line registers the config and model classes withAutoConfig/AutoModelForCausalLMso thatfrom_pretrainedresolves to the correct class. Notrust_remote_code, noauto_mapinconfig.json.
Configuration
| Parameter | Value |
|---|---|
| hidden_size | 512 |
| num_hidden_layers | 12 |
| num_attention_heads | 8 |
| num_key_value_heads | 1 (MQA) |
| head_dim | 64 |
| q_lora_rank / o_lora_rank | 256 / 256 |
| qk_rope_head_dim | 32 |
| n_routed_experts | 16 |
| n_shared_experts | 1 |
| num_experts_per_tok | 2 |
| num_hash_layers | 2 |
| moe_intermediate_size | 512 |
| sliding_window | 32 |
| max_position_embeddings | 1,048,576 (with YaRN factor=16, original=65536) |
| vocab_size | 129280 (real V4-Flash tokenizer) |
| num_nextn_predict_layers | 1 |
| hc_mult (n_hc) | 4 |
Parameter count
- Total: ~317 M
- Activated per token: ~170 M (top-2 of 16 routed experts)
Tokenizer & chat template
The tokenizer is copied verbatim from
deepseek-ai/DeepSeek-V4-Flash,
including its chat template. Vocab size: 129280.
Parameter naming
Tensor names match the official V4 safetensors index exactly (flat layout, no
model. prefix), so weight transfer / slicing from real V4-Pro / V4-Flash
weights into a same-named subset of this scaffold should be straightforward.
Limitations
- Random initialization — generates noise tokens; not useful for inference beyond architectural sanity checks.
- Not all V4 efficiency tricks are implemented — FP4 expert dtype, FP8 activation quant, batch-invariant kernels, anticipatory routing, on-disk KV storage are out of scope for this small replica.
- The CSA path materializes full per-query selected-KV tensors; long-context optimization is left to future versions.
Citation
@misc{deepseek_v4_2026,
author = {DeepSeek-AI},
title = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
year = {2026},
url = {https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash}
}
- Downloads last month
- 16