One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models
arXiv:2603.03291v1 Announce Type: new Abstract: Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable...