R^3: Replay, Reflection, and Ranking Rewards for LLM Reinforcement Learning
arXiv:2601.19620v2 Announce Type: replace Abstract: Large reasoning models (LRMs) aim to solve diverse and complex problems through structured reasoning. Recent advances in group-based policy optimization...