YummyYum commited on
Commit
e70c37c
·
verified ·
1 Parent(s): 856e5b4

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +0 -3
README.md CHANGED
@@ -1,6 +1,3 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
  # Introduction
5
  DeepSeek-V4-Flash is one of two models in the V4 series released by DeepSeek. It uses a Mixture of Experts (MoE) architecture with 284B total parameters, only 13B of which are activated, and supports a context length of up to 1 million tokens. Architecturally, the model introduces a hybrid attention mechanism, manifold-constrained hyperconnections, and the Muon optimizer. Pre-training data exceeds 32T tokens, and post-training follows a two-stage paradigm — first independently cultivating domain experts via SFT and GRPO reinforcement learning, then unifying multi-domain capabilities into a single model through online policy distillation. In maximum reasoning mode, a larger thinking budget allows its reasoning performance to approach that of the Pro version; however, due to its smaller parameter scale, it falls slightly short of Pro on pure knowledge tasks and the most complex agent workflows.
6
 
 
 
 
 
1
  # Introduction
2
  DeepSeek-V4-Flash is one of two models in the V4 series released by DeepSeek. It uses a Mixture of Experts (MoE) architecture with 284B total parameters, only 13B of which are activated, and supports a context length of up to 1 million tokens. Architecturally, the model introduces a hybrid attention mechanism, manifold-constrained hyperconnections, and the Muon optimizer. Pre-training data exceeds 32T tokens, and post-training follows a two-stage paradigm — first independently cultivating domain experts via SFT and GRPO reinforcement learning, then unifying multi-domain capabilities into a single model through online policy distillation. In maximum reasoning mode, a larger thinking budget allows its reasoning performance to approach that of the Pro version; however, due to its smaller parameter scale, it falls slightly short of Pro on pure knowledge tasks and the most complex agent workflows.
3