Reinforcement Learning · Alibaba Qwen Team
APPO: Agentic Procedural Policy Optimization for RL Agents
APPO branches RL rollouts at high-uncertainty, high-influence tokens instead of tool-call boundaries, lifting Qwen2.5-7B by 3.9 points over ARPO across 13 math, multi-hop, and deep-search benchmarks.