Reinforcement Learning · Xi'an Jiaotong University
Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching
Flow-DPPO swaps PPO ratio clipping for an exact per-step Gaussian KL term, lifting GenEval2 to 48.1 on SD3.5 (vs 39.9 for Flow-GRPO) while cutting policy drift roughly 4x.