When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation
arXiv:2601.20858v1 Announce Type: new Abstract: Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings,...