Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation

Yotam Perlitz; Ariel Gera; Ofir Arviv; Asaf Yehudai; Elron Bandel; Eyal Shnarch; Michal Shmueli-Scheuer; Leshem Choshen

NeurIPS 2025

Workshop paper

02 Dec 2025

Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation

Download paper

Abstract

Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., Spearman correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing, which can lead to invalid conclusions and mistrust. By analyzing over 40 prominent benchmarks, we show how some overlooked methodological choices can significantly influence BAT results. To address these inconsistencies, we propose a set of best practices and demonstrate their impact on robustness and validity. To foster adoption and facilitate future research, we introduce BenchBench (links in the App), a Py package and Leaderboard for BAT.

Workshop paper