Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,6 +1,3 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
| 4 |
# Introduction
|
| 5 |
DeepSeek-V4-Flash is one of two models in the V4 series released by DeepSeek. It uses a Mixture of Experts (MoE) architecture with 284B total parameters, only 13B of which are activated, and supports a context length of up to 1 million tokens. Architecturally, the model introduces a hybrid attention mechanism, manifold-constrained hyperconnections, and the Muon optimizer. Pre-training data exceeds 32T tokens, and post-training follows a two-stage paradigm — first independently cultivating domain experts via SFT and GRPO reinforcement learning, then unifying multi-domain capabilities into a single model through online policy distillation. In maximum reasoning mode, a larger thinking budget allows its reasoning performance to approach that of the Pro version; however, due to its smaller parameter scale, it falls slightly short of Pro on pure knowledge tasks and the most complex agent workflows.
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Introduction
|
| 2 |
DeepSeek-V4-Flash is one of two models in the V4 series released by DeepSeek. It uses a Mixture of Experts (MoE) architecture with 284B total parameters, only 13B of which are activated, and supports a context length of up to 1 million tokens. Architecturally, the model introduces a hybrid attention mechanism, manifold-constrained hyperconnections, and the Muon optimizer. Pre-training data exceeds 32T tokens, and post-training follows a two-stage paradigm — first independently cultivating domain experts via SFT and GRPO reinforcement learning, then unifying multi-domain capabilities into a single model through online policy distillation. In maximum reasoning mode, a larger thinking budget allows its reasoning performance to approach that of the Pro version; however, due to its smaller parameter scale, it falls slightly short of Pro on pure knowledge tasks and the most complex agent workflows.
|
| 3 |
|