LLM Reasoning · Reinforcement Learning
DeepSeek-R1: How Pure Reinforcement Learning Taught an LLM to Reason
DeepSeek-R1 learns to reason from reinforcement learning on whether its answer is correct — with no human reasoning examples — matches OpenAI o1 on AIME and MATH-500, and ships open MIT-licensed weights.
Quick answer
DeepSeek-R1 shows a language model can learn to reason — long chains of thought, self-checking, backtracking — from reinforcement learning on whether its final answer is correct, with no human-written reasoning examples to imitate. The released model scores 79.8% pass@1 on AIME 2024 and 97.3% on MATH-500, matching OpenAI’s o1-1217, and DeepSeek published the weights under an MIT license alongside six smaller distilled models.
What problem it solves
Before R1, the assumed recipe for a reasoning model was: collect a large set of human-written step-by-step solutions and fine-tune on them. That data is expensive, slow to produce, and caps the model at the quality of the humans who wrote it. DeepSeek-R1 asks whether you can skip it entirely — reward the model only for getting the answer right, and let it discover the reasoning process on its own.
What to know before reading
R1 is really two stories, and conflating them causes most of the confusion online. R1-Zero is the science result: a base model (DeepSeek-V3-Base) trained with pure RL and no supervised fine-tuning at all. DeepSeek-R1 is the product: R1-Zero’s recipe plus fixes that make the output usable. You also need one term — GRPO (Group Relative Policy Optimization), the RL algorithm that drops the separate value/critic network and instead scores each answer relative to a group of sampled answers, which makes large-scale RL much cheaper. And the “trained for a few million dollars” headlines refer to the V3 base model, not the R1 reasoning stage.
The core method
R1-Zero is trained with GRPO against rule-based rewards: a math answer is checked for correctness, code is run against tests, and a format reward forces the model to put its thinking inside dedicated tags. There is deliberately no neural reward model — the authors avoided one specifically to prevent reward hacking. With only this signal, responses get longer over training, and self-reflection, verification, and “aha” re-derivations emerge on their own rather than being taught.
R1-Zero had two ugly problems: its chains of thought were hard to read, and it mixed languages mid-answer. So DeepSeek-R1 wraps the same idea in a multi-stage pipeline: (1) fine-tune the base on a few thousand high-quality long chain-of-thought “cold-start” examples for readability, (2) run large-scale reasoning RL, (3) generate fresh supervised data by rejection-sampling the RL model’s best outputs, then (4) a final RL stage covering helpfulness and harmlessness, not just math. The result keeps the reasoning gains but reads cleanly.
Key results
- AIME 2024: R1 reaches 79.8% pass@1, essentially tied with OpenAI o1-1217 (79.2%).
- MATH-500: 97.3%, again on par with o1.
- Codeforces: a rating near 2029 — roughly the 96th percentile of human competitors.
- The emergence curve: R1-Zero’s AIME pass@1 climbs from 15.6% to 71.0% through RL alone, and to 86.7% with majority voting — with no human reasoning data anywhere in that number.
- Distillation: the reasoning transfers. DeepSeek-R1-Distill-Qwen-32B outperforms o1-mini on several reasoning benchmarks, and the 7B distill hits 55.5% on AIME 2024, beating far larger non-reasoning models.
Why it matters
R1 is the paper that made “reasoning models” reproducible outside the top closed labs. It handed the field a concrete open recipe — pure-RL emergence, GRPO, rule-based rewards, distillation into small models — and shipped MIT-licensed weights to back it up. Within weeks, open reasoning models and GRPO reimplementations were everywhere, and “RL on verifiable rewards” became a default research direction. The distillation result matters as much as the headline model: a 7B model running on a laptop can inherit most of the reasoning behavior, and that is what actually moves the ecosystem rather than one expensive flagship.
Limits and open questions
The recipe is narrow exactly where it is strongest. Rule-based rewards need a checkable answer, so the method shines on math, code, and STEM but does not obviously transfer to open-ended work like writing or judgment, where “correct” cannot be scored. R1-Zero’s raw output is genuinely hard to read, which is why the product needed cold-start data and extra stages. Language mixing and safety required dedicated handling rather than falling out of the RL for free. And while GRPO makes RL cheaper, the full pipeline still sits on top of a frontier-scale base model — the viral cost figure is the V3 base, not a from-scratch reasoning model, so reproducing R1 end-to-end remains expensive.
FAQ
How does DeepSeek-R1 actually learn to reason?
It is rewarded only for producing the correct final answer in a valid format, via reinforcement learning with GRPO. Longer reasoning, self-checking, and backtracking are not taught — they emerge because they raise the chance of a correct answer.
Is DeepSeek-R1 as good as OpenAI’s o1?
On the headline math benchmarks, yes: 79.8% vs 79.2% on AIME 2024 and ~97% on MATH-500, comparable to o1-1217. The open weights and distilled variants are what set it apart.
What is the difference between R1-Zero and DeepSeek-R1?
R1-Zero is pure RL with no supervised fine-tuning and proves reasoning can emerge from outcome rewards, but its output is hard to read. DeepSeek-R1 adds cold-start data and multi-stage training to make that same capability clean and usable.
Can I run DeepSeek-R1 myself?
The weights are MIT-licensed and downloadable, but the full model is large. The distilled 1.5B–70B models, built on Qwen and Llama, are the practical way to run R1-style reasoning on modest hardware.
One line: reward the answer, not the steps, and a capable base model can learn to think on its own. Read the original paper on arXiv.