Reinforcement Learning · Tianjin University
Why Multi-Domain RL Forgets, and How a Math Refresh Heals It
When you RL-tune an LLM across math, code, QA, and writing in sequence, math drops from 66.49 to 57.66 even though gradients look orthogonal. A short math refresh pulls it back to 66.04 without wrecking the other three.