Theorem Proving · Reinforcement Learning · LLM Reasoning
DeepSeek-Prover-V1.5: Lean Proofs with RL and Search
DeepSeek-Prover-V1.5 combines Lean feedback, reinforcement learning, and RMaxTS search, reaching 63.5% on miniF2F and 25.3% on ProofNet.
Quick answer
DeepSeek-Prover-V1.5 is a Lean 4 theorem-proving model that uses proof-assistant feedback during reinforcement learning and search. The key results are 63.5% on the miniF2F high-school benchmark and 25.3% on ProofNet. That matters because the verifier is not a preference model; Lean gives a hard correctness signal.
Why this paper matters now
This page covers the paper because it fills a concrete topic gap on researchpapers.dev and because the paper has a durable search intent: readers want the method explained, the main numbers separated from hype, and the deployment caveats stated plainly. The contribution is also easy to misread from the title alone. The practical question is not only what the authors built, but what new behavior becomes possible and where the claim stops.
How the method works
The model starts from DeepSeekMath-Base, is specialized on formal mathematical language, then receives supervised fine-tuning on an enhanced theorem-proving dataset. The extra step is reinforcement learning from proof assistant feedback, where Lean execution provides pass/fail and diagnostic signals. At inference time, DeepSeek-Prover-V1.5 does not rely only on one whole-proof sample: RMaxTS, a Monte-Carlo-tree-search variant, explores diverse proof paths with intrinsic reward.
Key results
- Achieves 63.5% on the miniF2F test set, a high-school level formal theorem proving benchmark.
- Achieves 25.3% on ProofNet, which targets undergraduate-level mathematics.
- Improves over DeepSeek-Prover-V1 by optimizing both training and inference.
- Uses Lean feedback as a correctness-grounded reward source rather than a soft human preference label.
My honest read
This is the right kind of RL for math: the environment can actually tell you whether a proof checks. The search component also reflects how theorem proving works in practice: one elegant proof is rare, many partial attempts fail, and exploration matters. The open question is cost and brittleness; proof search can look strong on benchmarks while still being painful on messy, library-dependent theorems.
Limits and open questions
miniF2F and ProofNet are useful but narrow. A Lean proof can fail because of library names, missing lemmas, or tactic syntax rather than mathematical ignorance. Search improves solve rate but adds compute, latency, and tuning complexity. The model is built for formalized problems; turning informal research mathematics into the right Lean statement remains a separate bottleneck. A second open question is reproducibility: many of these systems depend on data scale, hidden engineering choices, or evaluation protocols that are hard to replicate exactly. For readers, the safe takeaway is to treat the reported numbers as evidence for the paper’s setting, not as a guarantee that the method will transfer unchanged to every downstream product.
What to compare next
The right follow-up comparison is not simply the newest paper with a bigger model. Compare the evaluation target, the data regime, and the failure cost. A method that wins on a curated benchmark can still fail when prompts are longer, inputs are noisier, or downstream users need calibrated uncertainty. For this paper, the most useful next read is a work that stresses the same bottleneck from another angle: scaling, verification, interpretability, latency, or real-world deployment. That comparison keeps the result grounded and prevents the page from becoming a one-paper advertisement.
Practical takeaway
For builders, the immediate takeaway is to copy the evaluation habit before copying the architecture. Identify the bottleneck the paper actually attacks, choose a baseline that stresses that bottleneck, and report the failure cases with the same visibility as the wins. That is the difference between using the paper as research evidence and using it as a slogan.
FAQ
What is DeepSeek-Prover-V1.5?
DeepSeek-Prover-V1.5 is the paper’s named method or system. In one sentence, it changes the modeling setup so the target topic can be attacked with stronger representation learning, search, or generation machinery than the previous default.
What number should I remember from this paper?
The most useful numbers are in the Key results section above. They matter because they are specific enough to compare against future work rather than being vague claims of better quality or stronger performance.
Who should read this paper?
Read it if you track theorem proving research, need a concrete benchmark reference, or want to understand why this method became part of the field’s vocabulary. Skip it if you only need a production-ready recipe; the limits still matter.
One line: DeepSeek-Prover-V1.5 combines Lean feedback, reinforcement learning, and RMaxTS search, reaching 63.5% on miniF2F and 25.3% on ProofNet. Read the original source.