Axiom-Aware FunSearch for Non-Constructive Mathematics
Max Esposito, Besart Shyti
NeurIPS 2025
Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., Spearman correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing, which can lead to invalid conclusions and mistrust. By analyzing over 40 prominent benchmarks, we show how some overlooked methodological choices can significantly influence BAT results. To address these inconsistencies, we propose a set of best practices and demonstrate their impact on robustness and validity. To foster adoption and facilitate future research, we introduce BenchBench (links in the App), a Py package and Leaderboard for BAT.
Max Esposito, Besart Shyti
NeurIPS 2025
Jung koo Kang
NeurIPS 2025
C.A. Micchelli, W.L. Miranker
Journal of the ACM
Isha Puri, Shivchander Sudalairaj, et al.
NeurIPS 2025