LLM Reasoning · Samsung Research
TrOPD: Trust-Region On-Policy Distillation for Small LLMs
TrOPD masks on-policy distillation to the tokens where the teacher is actually trustworthy, adding +3.06 to +3.52 average points over standard OPD on math, code, and STEM benchmarks with 1.5B-1.7B students.