Upload README.md with huggingface_hub

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,6 +1,3 @@
----
-license: apache-2.0
----
 # Introduction
 DeepSeek-V4-Flash is one of two models in the V4 series released by DeepSeek. It uses a Mixture of Experts (MoE) architecture with 284B total parameters, only 13B of which are activated, and supports a context length of up to 1 million tokens. Architecturally, the model introduces a hybrid attention mechanism, manifold-constrained hyperconnections, and the Muon optimizer. Pre-training data exceeds 32T tokens, and post-training follows a two-stage paradigm — first independently cultivating domain experts via SFT and GRPO reinforcement learning, then unifying multi-domain capabilities into a single model through online policy distillation. In maximum reasoning mode, a larger thinking budget allows its reasoning performance to approach that of the Pro version; however, due to its smaller parameter scale, it falls slightly short of Pro on pure knowledge tasks and the most complex agent workflows.





1	# Introduction
2	DeepSeek-V4-Flash is one of two models in the V4 series released by DeepSeek. It uses a Mixture of Experts (MoE) architecture with 284B total parameters, only 13B of which are activated, and supports a context length of up to 1 million tokens. Architecturally, the model introduces a hybrid attention mechanism, manifold-constrained hyperconnections, and the Muon optimizer. Pre-training data exceeds 32T tokens, and post-training follows a two-stage paradigm — first independently cultivating domain experts via SFT and GRPO reinforcement learning, then unifying multi-domain capabilities into a single model through online policy distillation. In maximum reasoning mode, a larger thinking budget allows its reasoning performance to approach that of the Pro version; however, due to its smaller parameter scale, it falls slightly short of Pro on pure knowledge tasks and the most complex agent workflows.
3