Anti-Self-Distillation (AntiSD): A PMI Reward That Speeds Up Reasoning RL

Quick answer

AntiSD (Anti-Self-Distillation) is a per-token reward for reasoning RL that reaches GRPO’s accuracy in 2 to 10x fewer training steps and ends up to 11.5 points higher, tested on five models from 4B to 30B parameters on math reasoning benchmarks. The trick is to invert self-distillation: instead of pulling the policy toward a teacher that sees a privileged context (the question plus a partial solution), AntiSD rewards the tokens where that privileged teacher diverges from the unconditioned model — the tokens that actually carry new information.

The bug in ordinary self-distillation

The paper’s sharpest observation is diagnostic, not algorithmic: standard self-distillation, when used as an auxiliary signal in reasoning RL, sharpens exactly the wrong tokens. It inflates the teacher’s confidence on structural, low-information tokens (line breaks, “Step”, ”=”, boilerplate scaffolding) while reducing confidence on the deliberation tokens — the places where the model is actually weighing options. Minimizing teacher-student divergence therefore spends gradient on tokens that were never the bottleneck, and quietly flattens the distribution exactly where exploration matters. If you have ever wondered why a distillation auxiliary loss “helps a little then stalls,” this is a concrete mechanism for it.

What AntiSD computes

AntiSD reframes the per-token signal as conditional pointwise mutual information (PMI) between the next token and the privileged context. Concretely it compares two distributions over the next token: one from the model conditioned on a privileged context (question plus a hint or partial solution), and one from the model without it. Where the privileged context shifts the distribution, that token carries information and gets rewarded; where it does not, the token is structural and gets little. Rather than minimizing the gap (distillation), AntiSD maximizes it — hence “anti.” The advantage is shaped with a smooth function, phi(u) = 0.5 * (softplus(u) - log 2), which the authors bound at -0.5 * log 2 on the deliberation side so the reward cannot punish exploratory tokens too hard.

The entropy gate

The mechanism that keeps this from collapsing is an entropy-triggered gate. During a 5-step warmup the method auto-calibrates a baseline teacher entropy H_warm. The PMI reward deactivates once teacher entropy H falls below tau_down = 0.93 * H_warm — i.e. once the teacher has become confident, there is no more useful divergence to mine — and re-enables when entropy recovers to baseline. This is the honest engineering core: the signal is only applied while it is informative, which is why it can be aggressive early and harmless late.

Key results

AntiSD reaches GRPO baseline accuracy in 2 to 10x fewer training steps across the tested models.
Final accuracy improves by up to 11.5 points over the GRPO baseline.
On Qwen3-8B / HMMT 2025, AntiSD reaches GRPO’s peak in roughly 1/5 the steps and ends about +15 percentage points higher — the paper’s strongest single data point.
Evaluated on five models from 4B to 30B: Qwen3-4B-Instruct-2507, Qwen3-8B, Qwen3-30B-A3B, OLMo-3-7B-Instruct, and OLMo-3-7B-Think.
Math benchmarks span AIME 2024 / 2025 / 2026, HMMT 2025 (reported as avg@32) and MinervaMath (avg@4); code reasoning is checked on HumanEval+ and MBPP+ (avg@10).

Why this is interesting, and where I’d be cautious

The genuinely new idea is reading the per-token reward as PMI against a privileged context, which gives a principled answer to which tokens deserve credit — a cleaner story than most token-level reward heuristics bolted onto GRPO. The sample-efficiency numbers are the real selling point: 2-10x fewer steps is a training-cost claim, not just a leaderboard nudge, and the entropy gate is a tidy way to avoid the usual auxiliary-loss-goes-stale failure.

My honest reservation is breadth. The headline gains live on math, the regime where a privileged “partial solution” is easy to construct and a correct next token is well-defined. It is much less obvious how to define an informative privileged context for open-ended or agentic tasks, and the constants (0.93, the 5-step warmup, the phi bound) read as tuned. Treat the +11.5 / +15 figures as best-case math-reasoning results, not a universal speedup.

Limits and open questions

The authors state their evaluation focuses on math reasoning and name extensions to multi-turn agentic settings and broader coding benchmarks as natural next directions. Open questions: how sensitive results are to the gate threshold and warmup length; whether the privileged-context construction generalizes beyond math; how AntiSD interacts with longer RL runs where the teacher’s entropy dynamics differ; and whether the gains survive on models trained from different bases than the Qwen3 / OLMo-3 families tested here.

FAQ

What does Anti-Self-Distillation (AntiSD) actually do?

AntiSD computes a per-token reward equal to the pointwise mutual information between the next token and a privileged context, then maximizes that divergence instead of minimizing it as ordinary self-distillation would. It rewards high-information deliberation tokens and ignores structural ones.

How much faster is AntiSD than GRPO?

AntiSD reaches GRPO’s baseline accuracy in 2 to 10x fewer training steps and ends up to 11.5 points higher. On Qwen3-8B / HMMT 2025 it hits GRPO’s peak in about 1/5 the steps and finishes roughly +15 percentage points higher.

Why does AntiSD use an entropy gate?

The entropy gate switches the PMI reward off once the teacher’s entropy drops below 0.93x its warmup baseline, because a confident teacher no longer produces useful divergence to learn from. It re-enables when entropy recovers, so the signal is applied only while it is informative.

What are the limits of AntiSD?

AntiSD is validated on math reasoning across five models from 4B to 30B; the authors flag multi-turn agentic settings and broader coding benchmarks as untested. Defining an informative privileged context outside math is the main open challenge.

One line: reward the tokens where a privileged teacher disagrees, not where it agrees — and reasoning RL converges far faster. Read the original paper on arXiv.