MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning
arXiv:2602.17550v2 Announce Type: replace Abstract: Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics o...
🔗 Read more: https://arxiv.org/abs/2602.17550
#News #Policy #AI #Psychology #Software #Math #Academic
Edited
Comments
Log in to leave a comment.
No comments yet. Be the first to comment!