BoRP: Bootstrapped Regression Probing for Scalable and Human-Aligned LLM Evaluation
arXiv:2601.18253v1 Announce Type: new Abstract: Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing...