Topics

Fine-Tuning & Adaptation

Adapting pretrained models to new tasks cheaply, including parameter-efficient methods like LoRA.

AI Agents · Shanghai Jiao Tong University

LatentSkill: Bake Agent Skills Into LoRA Weights, Not the Prompt

A hypernetwork compiles a textual skill into a LoRA adapter in one forward pass. On ALFWorld, LatentSkill lifts success by 21.4 points (seen) with 64.1% fewer prefill tokens.

Reinforcement Learning · Tianjin University

Why Multi-Domain RL Forgets, and How a Math Refresh Heals It

When you RL-tune an LLM across math, code, QA, and writing in sequence, math drops from 66.49 to 57.66 even though gradients look orthogonal. A short math refresh pulls it back to 66.04 without wrecking the other three.

LLM Reasoning · Shanghai AI Laboratory

ThoughtFold: Cutting 56% of Reasoning Tokens Without Losing Accuracy

ThoughtFold trims the redundant reasoning of DeepSeek-R1-Distill-Qwen-7B by about 56% of tokens while keeping accuracy on AIME, MATH-500, and GPQA-Diamond intact, using a masked preference objective.

Fine-Tuning & Adaptation · The Hong Kong Polytechnic University

Token Teachability: Distilling LLMs on Just 5% of Tokens

Teachability-Aware OPD supervises only ~5% of tokens, those where the teacher's correction lands inside the student's top-K support, matching or beating full-token distillation (44.89 vs 42.37 on Qwen3-4B to 1.7B).

LLM Reasoning · Samsung Research

TrOPD: Trust-Region On-Policy Distillation for Small LLMs

TrOPD masks on-policy distillation to the tokens where the teacher is actually trustworthy, adding +3.06 to +3.52 average points over standard OPD on math, code, and STEM benchmarks with 1.5B-1.7B students.

LLM Reasoning · Shanghai AI Laboratory

SU-01: Gold-Medal Olympiad Reasoning from a 30B Open Model

SU-01, a 30B-A3B open model from Shanghai AI Lab, hits 35 points on IMO 2025 and clears gold lines at IPhO 2024/2025 using only ~338K short SFT trajectories plus a 200-step two-stage RL pipeline.

LLM Reasoning · Xiaohongshu

Anti-Self-Distillation (AntiSD): A PMI Reward That Speeds Up Reasoning RL

AntiSD inverts self-distillation — it rewards tokens where a privileged context disagrees with the base model — reaching GRPO's accuracy in 2 to 10x fewer steps and ending up to 11.5 points higher on 4B-30B models.

Code Generation · University of Waterloo

Code2LoRA: Hypernetworks That Generate Repo-Specific LoRA Adapters

Code2LoRA trains a hypernetwork to emit a repo-specific LoRA adapter for a code model with no inference-time token cost — 66.2% in-repo and 63.8% cross-repo exact match, plus an Evo variant that tracks diffs with a GRU.

LLM Reasoning · Renmin University of China

DelTA: Discriminative Token Credit Assignment for RLVR Reasoning

DelTA reweights RLVR updates so credit lands on tokens that actually separate right answers from wrong ones, lifting Qwen3-8B-Base by 3.26 and Qwen3-14B-Base by 2.62 average points over the strongest baselines.

LLM Reasoning · Alibaba Qwen Team

DVAO: Variance-Adaptive Advantage Weighting for Multi-Reward RL

DVAO weights each reward by its in-group variance instead of fixed coefficients, lifting Qwen3-4B-Base from 38.99% to 42.19% average accuracy and length compliance to 99.91% in math-plus-tool-use RL.

Text-to-Image · University of Science and Technology of China

Flow-OPD: On-Policy Distillation Fixes Reward Conflict in Text-to-Image RL

Flow-OPD trains one specialist teacher per reward, then distills them on-policy into one SD 3.5 student — lifting GenEval 0.63 to 0.92 and OCR 0.59 to 0.94 without the aesthetic collapse of multi-reward GRPO.

Fine-Tuning & Adaptation · HKUST

On the Geometry of On-Policy Distillation: A Distinct Update Regime

On-policy distillation does not sit between SFT and RLVR — it carves its own geometry. Its updates touch fewer weights, avoid principal directions, and lock into a narrow low-dimensional subspace early in training.

Efficient AI · Microsoft Research

LoRA Explained: Low-Rank Adaptation for Fine-Tuning LLMs

LoRA freezes a pretrained model and trains tiny low-rank matrices per layer instead — cutting trainable parameters up to 10,000x and GPU memory 3x versus full GPT-3 175B fine-tuning, with no extra latency.

Speech Recognition · Shanghai AI Laboratory

Mega-ASR: Scaling Acoustic Simulation for In-the-Wild Speech Recognition

Mega-ASR fights ASR's noise-robustness gap by synthesizing 2.4M clips across 54 compound acoustic scenarios, then training Qwen3-ASR-1.7B in two stages — cutting WER to 45.69% vs 54.01% on VOiCES R4-B-F.

Fine-Tuning & Adaptation · Mind Lab

MinT: Infrastructure for Training and Serving Millions of LoRA LLMs

MinT keeps one frontier base model resident and swaps only LoRA adapters, cutting the model-handoff step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while addressing million-scale adapter catalogs.

Fine-Tuning & Adaptation · Mind Lab

Scaling PEFT: Toward a Million Personal Models on One Base

A position paper reframing LoRA adapters as persistent personal state, not a cheap full-finetune substitute, across three axes: scale up the base, scale down the adapter, scale out to millions, plus a serving stack MinT.

AI Agents · Zhejiang University

Self-Distilled Agentic RL: A Privileged Teacher Steering GRPO Per Token

SDAR adds a gated, token-level self-distillation signal from a skill-augmented teacher on top of GRPO, lifting multi-turn agents by up to +10.2 points on WebShop and +9.4 on ALFWorld for small Qwen models.

AI Agents · University of Science and Technology of China

Skill1: One RL Policy That Selects, Uses, and Distills Agent Skills

Skill1 trains a single Qwen2.5-7B policy to retrieve, apply, and create reusable skills under one task-outcome reward — reaching 97.5% on ALFWorld, 6.5 points over the strongest RL-only baseline.

Fine-Tuning & Adaptation · T-Tech

Trust-Region Behavior Blending: A Warmup Fix for On-Policy Distillation

On-policy distillation wastes teacher supervision on a student's weak early rollouts. TRB blends teacher-like behavior inside a KL trust region during warmup, then anneals it to zero — best average on two math settings.