DeepSeek-V4 is a series of Mixture-of-Experts (MoE) language models comprising two variants:
DeepSeek-V4-Pro: 1.6T total parameters, 49B activated per token
DeepSeek-V4-Flash: 284B total parameters, 13B activated per token
Both support 1 million token context length and represent a major architectural evolution from DeepSeek-V3, focused primarily on breaking the efficiency barrier for ultra-long-context processing.
Architecture
DeepSeek-V4-Pro Specific Configuration
61 Transformer layers, hidden dimension d = 7168
First 2 layers: HCA (Heavily Compressed Attention)
Remaining layers: CSA and HCA interleaved
MoE in all layers: 1 shared expert + 384 routed experts per layer, intermediate hidden dimension 3072 per expert, 6 experts activated per token
First 3 MoE layers use Hash routing (routing based on input token ID)
Query heads n_h = 128, head dimension c = 512, query compression dimension d_c = 1536
Output projection groups g = 16, intermediate output dimension d_g = 1024
Sliding window size n_win = 128
HCA Parameters (V4-Pro)
Compression rate m’ = 128
Same query heads (128), head dimension (512), query compression dimension (1536) as CSA
Same grouped output projection (g=16, d_g=1024)
Key Innovation 1: Hybrid CSA/HCA Attention
This is the most critical architectural innovation, designed to make million-token contexts computationally feasible.
Compressed Sparse Attention (CSA)
CSA operates in two stages:
KV Compression: Computes two series of KV entries (C^a, C^b) and compression weights (Z^a, Z^b) from hidden states. Every m tokens are compressed into one entry using learned softmax weights and positional biases. The compression uses overlapping windows — C^b indices for block i overlap with C^a indices for block i-1 — so the effective compression ratio is 1/m.
Lightning Indexer for Sparse Selection: After compression, a lightweight indexer selects top-k compressed KV entries per query token. Indexer queries are produced in a low-rank manner (down-projection → up-projection). Index scores are computed as a weighted sum of ReLU-activated dot products across multiple indexer heads.
Shared Key-Value MQA: Each compressed KV entry serves as both key and value (shared key-value). Attention is Multi-Query Attention (MQA) style — all query heads share the single KV head. Queries share the same compressed latent vector used for the indexer.
Grouped Output Projection: Because c·n_h is very large, outputs are split into g groups, each projected to a d_g-dimensional intermediate, then all intermediates are projected to the final d-dimensional output.
Heavily Compressed Attention (HCA)
Similar to CSA but with much heavier compression (m’ = 128 vs m = 4) and no sparse selection — all compressed KV entries are attended to densely. Does not use overlapping compression. Same shared KV MQA and grouped output projection strategies.
Additional Attention Details
Partial RoPE: Applied only to the last 64 dimensions of queries, KV entries, and attention outputs. Since KV entries serve as both keys and values, RoPE with position -i is applied to outputs to cancel absolute position embeddings and restore relative positional information.
Sliding Window Attention Branch: Both CSA and HCA include an uncompressed sliding window (n_win = 128 tokens) for fine-grained local dependencies, since queries cannot access tokens within their own unfinished compression block.
Attention Sink: Learnable per-head sink logits added to the softmax denominator, allowing attention scores to sum to less than 1.
QK Normalization: RMSNorm applied to each query head and the single KV head before core attention.
Output mapping C_l ∈ ℝ^(n_hc × 1): Injects layer output back into the stream
Key constraint: B_l is constrained to the Birkhoff polytope (doubly stochastic matrices), ensuring spectral norm ≤ 1 and non-expansive residual transformation for numerical stability. This is achieved via the Sinkhorn-Knopp algorithm (20 iterations) applied to exp(B̃_l).
Parameters are dynamically generated — decomposed into input-dependent components (via small projection matrices from the flattened, RMSNorm’d residual state) and static biases, with small learnable gating factors. A_l uses Sigmoid for non-negativity; C_l uses 2·Sigmoid.
Key Innovation 3: Muon Optimizer
Muon optimizer for most parameters, AdamW for embeddings, prediction head, mHC static biases/gating factors, and RMSNorm weights.
Muon specifics:
Momentum = 0.95, weight decay = 0.1
Update RMS rescaled to 0.18 for learning rate reutilization
Hybrid Newton-Schulz iterations (10 total): 8 steps with coefficients (3.4445, -4.7750, 2.0315) for rapid convergence, then 2 steps with (2, -1.5, 0.5) for precise stabilization
Nesterov momentum trick applied
No QK-Clip needed due to attention architecture’s built-in QK normalization
Efficiency Analysis
At 1M tokens, DeepSeek-V4-Pro achieves:
27% of single-token inference FLOPs compared to DeepSeek-V3.2
10% of KV cache size compared to DeepSeek-V3.2
~2% of KV cache compared to BF16 GQA8 with head dim 128
Mixed precision KV storage: BF16 for RoPE dimensions, FP8 for the rest
FP4 precision for lightning indexer attention computation
Smaller attention top-k than DeepSeek-V3.2
FP4 for routed expert parameters (currently same peak FLOPs as FP8 on existing hardware, but theoretically 1/3 more efficient on future hardware)
Training
Pre-Training Data
33T tokens for V4-Pro (32T for V4-Flash)
Corpus: math, code, web pages, long documents, multilingual data
Emphasis on long-document curation (scientific papers, technical reports)
Filtering to remove auto-generated/templated content
Agentic data incorporated during mid-training
Sample-level attention masking (different from V3)
Document packing from different sources to minimize truncation
Training Schedule (V4-Pro)
Peak LR: 2.0 × 10⁻⁴, end LR: 2.0 × 10⁻⁵
Linear warmup for first 2000 steps, cosine decay near end
Max batch size: 94.4M tokens
Sequence length progression: 4K → 16K → 64K → 1M
Dense attention warmup (longer than V4-Flash) before introducing sparse attention
Sparse attention introduced at 64K sequence length with two-stage method: first warm up the lightning indexer, then full sparse training
Training Stability Techniques
Anticipatory Routing: Decouples backbone and routing network updates. At step t, uses current parameters θ_t for features but routing indices from θ_{t-Δt}. Data for step t is fetched at step t-Δt and routing indices are pre-computed and cached. Adds ~20% wall-clock overhead. Applied dynamically — triggered only when loss spikes are detected, then reverts to standard training.
SwiGLU Clamping: Linear component clamped to [-10, 10], gate component upper-bounded at 10. Eliminates outliers without performance degradation.
On-Policy Distillation (OPD): Unified model consolidation — replaces mixed RL from V3.2
Reasoning Modes
Three modes: Non-think, Think High, Think Max — differentiated by length penalties and context windows during RL. Think Max includes a special system prompt instruction for maximum reasoning effort. Response format uses <think></think> tags.
Generative Reward Model (GRM)
For hard-to-verify tasks, the actor network itself functions as the GRM — RL is applied directly to optimize both evaluative (judging) and generative capabilities jointly. Uses rubric-guided RL data, minimal human annotations needed.
Tool-Call Schema
New XML-based format using “|DSML|” special token — mitigates escaping failures and tool-call errors.
Interleaved Thinking
Tool-calling scenarios: All reasoning content preserved across entire conversation (including user message boundaries) — leverages 1M context
General conversation: Previous reasoning discarded at new user messages (unchanged from V3.2)
结语
DeepSeek V4 是目前最强开源模型——参数规模最大、价格最低、推理能力最接近顶流。
但它并非完美:无多模态、知识储备稍逊、政治敏感度偏高。
如果你的场景需要:
高性价比的代码/推理能力 → V4 值得一试
最新知识或多模态 → 仍需 GPT-5.4 / Claude Opus 4.7 / Gemini 3.1 Pro