Vintage Code, Modern Judges: Meta-Validation in Low Data RegimesGal AmramOra Nova Fandinaet al.2025ASE 2025
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech EvaluationNoy SternlichtAriel Geraet al.2025EMNLP 2025
Agentic Process Observability: Discovering Behavioral VariabilityFabiana FournierLior Limonadet al.2025ECAI 2025
Exposing AI Bias by Crowdsourcing: Democratizing Critique of Large Language ModelsHangzhi GuoPranav Venkitet al.2025AIES 2025
The NorthPole Validator: A Cycle-Accurate Simulator for HW/SW Codesign of a Prescheduled Neural Inference AcceleratorAlexander AndreopoulosMichael Deboleet al.2025HPEC 2025
Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?Giacomo CamposampieroMichael Herscheet al.2025NeSy 2025
StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional EvaluationSatyananda KashyapSola Shiraiet al.2025VLDB 2025
Evaluating LLM-based Agents: Foundations, Best Practices and Open ChallengesRoy Bar-HaimArman Cohanet al.2025IJCAI 2025
Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language ModelsGeorge KourItay Nakashet al.2025ACL 2025