DeepSeek-V3 Explained: A 671B MoE Trained for 2.788M GPU Hours

Quick answer

DeepSeek-V3 is a 671B-parameter Mixture-of-Experts (MoE) language model that activates only 37B parameters per token, pre-trained on 14.8 trillion tokens and released with open weights. The headline is the bill: its full training took 2.788 million H800 GPU hours — at the report’s assumed $2/GPU-hour, roughly $5.6M — yet it matches leading closed-source models on many benchmarks and beats other open models. The interesting claim is not “best scores”; it is “frontier-class scores for an order of magnitude less compute than people assumed it took.”

How DeepSeek-V3 spends compute so efficiently

DeepSeek-V3 is sparse by design. Of its 671B parameters, a router sends each token to a small subset of experts, so only 37B parameters fire per token. You pay for a 671B model’s knowledge capacity but a 37B model’s per-token FLOPs. The architecture — Multi-head Latent Attention (MLA) plus the DeepSeekMoE expert layout — was already validated in DeepSeek-V2, so V3 is less a new design than a scaled, hardened, and cost-engineered execution of an existing one.

Multi-head latent attention

MLA attacks the part of inference that usually dominates memory: the key-value cache. Standard multi-head attention stores a full set of keys and values for every token, and that cache balloons with context length and batch size. MLA instead compresses keys and values into a low-rank latent vector and caches that, reconstructing the per-head representations on the fly. The result is a much smaller KV cache, which is what makes long-context inference on a 671B model financially survivable. This is the architectural choice that turns “huge model” into “huge model you can actually serve.”

Auxiliary-loss-free load balancing

MoE models have a chronic problem: if the router favors a few experts, the rest sit idle and capacity is wasted. The usual fix adds an auxiliary balancing loss, but that loss fights the main objective and quietly degrades quality. DeepSeek-V3’s contribution here is an auxiliary-loss-free strategy: it nudges balance by adjusting a per-expert bias term used in routing, rather than by adding a competing loss term. The point is subtle but real — you keep experts busy without taxing the thing you actually care about, model quality.

Multi-token prediction and FP8 training

Two more levers carry the efficiency story. Multi-token prediction (MTP) trains the model to predict several future tokens per position instead of one, which densifies the learning signal and can later be reused for speculative decoding to speed up inference. FP8 mixed-precision training runs much of the forward and backward pass in 8-bit floating point rather than 16-bit, roughly halving the memory and bandwidth cost of the heaviest operations. Doing FP8 at this scale without the training diverging is the hard part, and the report is most useful as a recipe for the engineering that keeps it stable.

Training under a tight compute budget

The number that traveled was 2.788M H800 GPU hours for the full run on 14.8T tokens — and, just as notably, that the run was stable: the authors report no irrecoverable loss spikes and no rollbacks across the entire pre-training. That stability is itself a result. Anyone who has run a large training job knows that a single unrecoverable divergence can burn weeks of cluster time, so “we never had to roll back” is a stronger efficiency claim than the raw GPU-hour figure. Note the honest scope: the 2.788M hours cover pre-training, context extension, and post-training compute — not the research, ablations, and failed runs that preceded the final recipe.

Key results

Scale vs. cost: 671B total parameters, 37B activated per token, 14.8T training tokens, 2.788M H800 GPU hours for the full training pipeline.
Quality tier: DeepSeek-V3 outperforms other open-source models and reaches performance comparable to leading closed-source models across the report’s evaluation suite, with particular strength in math and code.
Efficiency primitives that worked: MLA shrinks the KV cache; auxiliary-loss-free balancing keeps experts utilized without a quality-eroding loss; MTP densifies supervision and enables speculative decoding.
Training stability: zero irrecoverable loss spikes and zero rollbacks across the entire pre-training run — unusual at this scale.
Openness: model checkpoints are released, so the efficiency claims are inspectable rather than marketing.

Why DeepSeek-V3 matters now

DeepSeek-V3 reset the assumed price of a frontier-class open model. Before it, “you need a closed lab’s budget” was the unstated premise; V3’s 2.788M GPU hours made that premise look negotiable, and it became the base model behind the DeepSeek-R1 reasoning system — so its efficiency directly underwrote a reasoning model that rattled the closed labs. For practitioners the deeper value is the playbook: MLA, auxiliary-loss-free routing, MTP, and stable FP8 training are a concrete, reusable stack for getting more model per dollar.

Limits and open questions

The cost figure is real but easy to misread. 2.788M GPU hours is the final-run compute; it excludes the research, failed experiments, data pipeline, and human effort that produced the recipe — so “trained for ~$5.6M” is the marginal cost of one successful run, not the cost of reproducing DeepSeek’s results from scratch. The architecture also leans on hardware specifics: FP8 stability and MLA serving were tuned for a particular cluster, and naive reproduction elsewhere may not land the same numbers. As a technical report it is light on some failure analysis and broad safety evaluation, and the open weights are large enough that “open” still means “open to those with serious GPUs.” Finally, V3 is a strong general model but not a dedicated reasoning model — that capability arrived with R1, built on top of this base.

FAQ

What makes DeepSeek-V3 efficient if it has 671B parameters?

It is a Mixture-of-Experts model that activates only 37B of its 671B parameters per token, so per-token compute matches a much smaller model while total capacity stays large. Multi-head latent attention further shrinks the inference KV cache.

How much did it cost to train DeepSeek-V3?

The full training took 2.788M H800 GPU hours, which at the report’s assumed $2 per GPU-hour is about $5.6M. That figure is the final training run only — it excludes research, ablations, and failed experiments.

What is multi-head latent attention (MLA) in DeepSeek-V3?

MLA compresses keys and values into a low-rank latent vector and caches that instead of full per-head key/value tensors, drastically reducing the KV-cache memory that normally dominates long-context inference cost.

Is DeepSeek-V3 the same as DeepSeek-R1?

No. DeepSeek-V3 is the open base model; DeepSeek-R1 is the reasoning model trained on top of the V3 base via reinforcement learning. The viral “few million dollars” cost refers to V3’s pre-training, not R1’s reasoning stage.

One line: DeepSeek-V3 shows a 671B open model can reach the closed-source tier for 2.788M GPU hours — capacity of a giant, per-token cost of something far smaller. Read the original paper on arXiv.