IBM Quantum Developer Conference 2025
- Atlanta, Georgia, USA
IBM is proud to sponsor the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025).
We look forward to meeting you at the event and telling you more about our latest work and career opportunities at IBM Research. Our team will be presenting a series of workshops, papers and demos related to a broad range of AI topics.
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
Visit us at the IBM Booth to meet with IBM researchers and recruiters to speak about future job opportunities or 2026 summer internships.
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
Full Booth Schedule with staff and demos (by time) Booth Demos (by title)
As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it's crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs.
These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially reinforce certain viewpoints.
This paper presents the Preference, Opinion, and Belief survey (\benchmark{}), a benchmark developed to assess LLMs' subjective inclinations across societal, cultural, ethical, and personal domains.
We applied our benchmark to evaluate leading open- and closed-source LLMs, measuring desired properties such as reliability, neutrality, and consistency.
In addition, we investigated the effect of increasing the test-time compute, through reasoning and self-reflection mechanisms, on those metrics.
While effective in other tasks, our results show that these mechanisms offer only limited gains in our domain.
Furthermore, we reveal that newer model versions are becoming less consistent and more biased toward specific viewpoints, highlighting a blind spot and a concerning trend.
POBs: https://ibm.github.io/POBS
George Kour (IBM); Itay Nakash (IBM); Ateret Anaby-Tavor (IBM); Michal Shmueli-Scheuer (IBM)
Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versa- tility of such evaluations make the use of LLM- based judges a compelling solution for this chal- lenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source sys- tems. We argue that this setting overlooks criti- cal factors affecting system-level ranking, such as a judge’s positive or negative bias towards certain systems. To address this gap, we con- duct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge’s quality is as- sessed by comparing the resulting system rank- ing to a human-based ranking. Beyond over- all judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.
Ariel Gera (IBM); Odellia Boni (IBM); Yotam Perlitz (IBM); Roy Bar-Haim (IBM); Lilach Edelstein (IBM); Asaf Yehudai (IBM)
Accurate multi-modal document retrieval iscrucial for Retrieval-Augmented Generation(RAG), yet existing benchmarks do not fullycapture real-world challenges with their currentdesign. We introduce REAL-MM-RAG, an au-tomatically generated benchmark designed toaddress four key properties essential for real-world retrieval: (i) multi-modal documents, (ii)enhanced difficulty, (iii) Realistic-RAG queriesand (iv) accurate labeling. Additionally, wepropose a multi-difficulty-level scheme basedon query rephrasing to evaluate models’ seman-tic understanding beyond keyword matching.Our benchmark reveals significant model weak-nesses, particularly in handling table-heavydocuments and robustness to query rephras-ing. To mitigate these shortcomings, we cu-rate a rephrased training set and introduce anew finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models toachieve state-of-the-art retrieval performanceon REAL-MM-RAG benchmark. Our workoffers a better way to evaluate and improve re-trieval in multi-modal RAG systems while alsoproviding training data and models that addresscurrent limitations. Our benchmark is availableat this project page.
Navve Wasserman (IBM); Roi Pony (IBM); Oshri Naparstek (IBM); Adi Raz Goldfarb (IBM); Eliyahu Schwartz (IBM); Udi Barzelay (IBM); Leonid Karlinsky (IBM)
There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models are not either explicitly trained to be safe, or experience a loss in their safety abilities in the process, making them capable of generating harmful content. We observe that simple interpolation between the domain and alignment delta parameters leads to safer domain-specific models that preserve their utility. Building on this, we introduce MergeAlign, a simple, efficient, and effective model merging-based alignment method. We apply MergeAlign on Llama3 models that are experts in medicine and finance, obtaining substantial safety alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged, as well as the applicability of MergeAlign on more general code and math expert models using the Qwen-2.5 series of models. We hope our findings open new research avenues towards efficient development and deployment of safe expert LLMs.
Megh Thakkar; Quentin Fournier; Matthew Riemer (IBM); Pin-Yu Chen (IBM); Amal Zouaq; Payel Das (IBM); Sarath Chandar
Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. In this work, we present \dataname{}~(\explicitdataname{}) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an \emph{holistic} perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against \dataname{}, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. \dataname{} consists of more than \datasetsize{} prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation.
Eliya Habba; Ofir Arviv (IBM); Itay Itzhak; Yotam Perlitz (IBM); Elron Bandel (IBM); Leshem Choshen (IBM); Michal Shmueli-Scheuer (IBM); Gabriel Stanovsky
Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a limited number of senses (or meanings). We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language. To construct a sense embedding dictionary, we apply a clustering algorithm to embeddings generated by an LLM and consider the cluster centers as representative sense embeddings. In addition, we propose a novel knowledge distillation method that leverages the sense dictionary to learn a smaller student model that mimics the senses from the much larger base LLM model, offering significant space and inference time savings, while maintaining competitive performance. Via thorough experiments on various benchmarks, we showcase the effectiveness of our sense embeddings and knowledge distillation approach.
Qitong Wang; Mohammed Zaki; Georgios Kollias (IBM); Vasileios Kalantzis (IBM)
Safety, security, and compliance are essential requirements when aligning large language models (LLMs). However, many seemingly aligned LLMs are soon shown to be susceptible to jailbreak attacks. These attacks aim to circumvent the models' safety guardrails and security mechanisms by introducing jailbreak prompts into malicious queries. In response to these challenges, this paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism specifically designed to protect LLMs against such sophisticated jailbreak strategies. Unlike previous approaches, which have often compromised the utility of the model for the sake of safety, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Our method uses strategically designed suffix prompts that effectively thwart a wide range of standard and adaptive jailbreak techniques. Empirical results conducted on Llama-2-7B-Chat and Mistral-7B-Instruct-v0.2 demonstrate the robustness and adaptability of DPP, showing significant reductions in ASR with negligible impact on utility. Our approach not only outperforms existing defense strategies in balancing safety and functionality, but also provides a scalable and robust solution to various LLM platforms.
Chen Xiong; Xiangyu Qi; Pin-Yu Chen (IBM); Tsung-yi Ho
Text-to-SQL aims to translate natural language queries from users into SQL statements executable over a database, which is highly practical as it enables anyone to easily retrieve the desired information from the database. Recently, many existing approaches tackle this problem with Large Language Models (LLMs), leveraging their strong capability in understanding user queries and generating corresponding SQL code. Yet, the parametric knowledge in LLMs might be limited to covering all the diverse and domain-specific queries that particularly require grounding in the various database schemas, which makes the generated SQLs less accurate sometimes. To address this problem, we propose constructing a knowledge base for text-to-SQL -- a foundational source of common knowledge -- from which we retrieve and generate the necessary knowledge for given diverse queries. Due to this, our work has a different focus from existing work that either manually annotates knowledge or generates only a few pieces of knowledge for each query. In particular, our knowledge base is comprehensive and constructed based on a combination of all the available existing questions and their associated database schemas along with their relevant knowledge via LLM prompting, and can be effectively reused for unseen databases from different datasets. We experimentally validate our approach on benchmark text-to-SQL datasets, considering both overlapping and non-overlapping database scenarios, on which it outperforms relevant baselines substantially.
Jinheon Baek; Horst Samulowitz (IBM); Oktie Hassanzadeh (IBM); Shankar Subramaniam (IBM); Sola Shirai (IBM); Alfio Gliozzo (IBM); Debarun Bhattacharjya (IBM)
Conversational agents are increasingly woven into individuals’ personal lives, yet users of-ten underestimate the privacy risks involved. The moment users share information with these agents (e.g., LLMs), their private information becomes vulnerable to exposure. In this paper, we characterize the notion of contextual privacy for user interactions with LLMs. It aims to minimize privacy risks by ensuring that users (sender) disclose only information that is both relevant and necessary for achieving their intended goals when interacting with LLMs (untrusted receivers). Through a formative design user study, we observe how even “privacy-conscious” users inadvertently reveal sensitive information through indirect disclosures. Based on insights from this study, we propose a locally-deployable framework that operates between users and LLMs, and identifies and reformulates out-of-context information in user prompts. Our evaluation using examples from ShareGPT shows that lightweight models can effectively implement this framework, achieving strong gains in contextual privacy while preserving the user’s intended interaction goals through different approaches to classify information relevant to the intended goals.
Ivoline Ngong (IBM); Swanand Ravindra Kadhe (IBM); Hao Wang (IBM); Keerthiram Murugesan (IBM); Justin Weisz (IBM); Amit Dhurandhar (IBM); Karthikeyan Natesan Ramamurthy (IBM)
What happens when a named entity recognition (NER) system encounters entities it has never seen before? In practical applications, models must generalize to unseen entity types where labeled training data is either unavailable or severely limited—a challenge that demands zero-shot learning capabilities. While large language models (LLMs) offer extensive parametric knowledge, they fall short in cost-effectiveness compared to specialized small encoders. Existing zero-shot methods predominantly adopt a relaxed definition of the term with potential leakage issues and rely on entity type names for generalization, overlooking the value of richer descriptions for disambiguation. In this work, we introduce ZeroNER, a description-driven framework that enhances hard zero-shot NER in low-resource settings. By leveraging general-domain annotations and entity type descriptions with LLM supervision, ZeroNER enables a BERT-based student model to successfully identify unseen entity types. Evaluated on three real-world benchmarks, ZeroNER consistently outperforms LLMs by up to 16% in F1 score, and surpasses lightweight baselines that use type names alone. Our analysis further reveals that LLMs derive significant benefits from incorporating type descriptions in the prompts.
Alessio Cocchieri; Marcos Martínez Galindo (IBM); Giacomo Frisoni; Gianluca Moro Moro; Claudio Sartori Sartori; Giuseppe Tagliavini
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
Full Booth Schedule with staff and demos (by time) Booth Demos (by title)
Extracting scientific evidence from biomedical studies for clinical research questions (e.g., Does stem cell transplantation improve quality of life in patients with medically refractory Crohn's disease compared to placebo?) is a crucial step in synthesising biomedical evidence. In this paper, we focus on the task of document-level scientific evidence extraction for clinical questions with conflicting evidence. To support this task, we create a dataset called CochraneForest, leveraging forest plots from Cochrane systematic reviews. It comprises 202 annotated forest plots, associated clinical research questions, full texts of studies, and study-specific conclusions. Building on CochraneForest, we propose URCA (Uniform Retrieval Clustered Augmentation), a retrieval-augmented generation framework designed to tackle the unique challenges of evidence extraction. Our experiments show that URCA outperforms the best existing methods by up to 10.3% in F1 score on this task. However, the results also underscore the complexity of CochraneForest, establishing it as a challenging testbed for advancing automated evidence synthesis systems.
Massimiliano Pronesti (IBM); Joao Bettencourt-Silva (IBM); Paul Flanagan; Alessandra Pascale (IBM); Oisín Redmond; Anya Belz; Yufang Hou (IBM)
Recent advances in Large Language Models (LLMs) have yielded impressive successes on many language tasks. However, efficient processing of long contexts using LLMs remains a significant challenge. We introduce EpMAN -- a method for processing long contexts in an episodic memory module while holistically attending to semantically-relevant context chunks. Output from episodic attention is then used to reweigh the decoder's self-attention to the stored KV cache of the context during training and generation. When an LLM decoder is trained using EpMAN, its performance on multiple challenging single-hop long-context recall and question-answering benchmarks is found to be stronger and more robust across the range from 16k to 256k tokens than baseline decoders trained with self-attention, and popular retrieval-augmented generation frameworks.
SUBHAJIT CHAUDHURY (IBM); Payel Das (IBM); Sarath Swaminathan (IBM); Georgios Kollias (IBM); Elliot Nelson (IBM); Khushbu Pahwa (IBM); Tejaswini Pedapati (IBM); Igor Melnyk (IBM); Matthew Riemer (IBM)
Industrial applications pose heightened requirements for consistency and reliability of large language models (LLMs). While LLMs are being tested with increasingly complex reasoning tasks, we argue that much can be learned via diagnostic tools that probe a fundamentally basic type of reasoning: conceptual consistency, e.g., a rule applying to “all surgeons” must also apply to “cardiac surgeons” since a cardiac sur-geon is a type of surgeon. In this emerging industry track submission, we propose a method that takes concept hierarchies from a knowledge graph (KG) and automatically generates benchmarks that test conceptual consistency in LLMs. We develop a multi-domain benchmark that reveals rates of conceptual inconsistencies in several state of the art LLMs. Additionally, we use measured levels of inconsistency and disagreement in LLMs to find potentially problematic subgraphs in the reference KG. As such, it offers a scalable complement to symbolic curation, maintenance, and refinement of knowledge graphs, which is a critical activity in KG-based industrial applications.
Rosario Uceda-Sosa (IBM); Maria Chang (IBM); Karthikeyan Natesan Ramamurthy (IBM); Moninder Singh (IBM)
Despite the increasing use of large language models (LLMs) for context-grounded tasks like summarization and question-answering, understanding what makes an LLM produce a certain response is challenging. We propose Multi-Level Explanations for Generative Language Models (MExGen), a technique to provide explanations for context-grounded text generation. MExGen assigns scores to parts of the context to quantify their influence on the model’s output. It extends attribution methods like LIME and SHAP to LLMs used in context-grounded tasks where (1) inference cost is high, (2) input text is long, and (3) the output is text. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and question answering. The results show that our framework can provide more faithful explanations of generated output than available alternatives, including LLM self-explanations. We open-source code for MExGen as part of the ICX360 toolkit: https://github.com/IBM/ICX360.
Lucas Monteiro Paes; Dennis Wei (IBM); Hyo Jin Do (IBM); Hendrik Strobelt (IBM); Ronny Luss (IBM); Amit Dhurandhar (IBM); Manish Nagireddy (IBM); Karthikeyan Natesan Ramamurthy (IBM); Prasanna Sattigeri (IBM); Werner Geyer (IBM); Soumya Ghosh (IBM)
A comprehensive benchmark is crucial for evaluating automated Business Intelligence (BI) systems and their real-world effectiveness. We propose a holistic, end-to-end framework that assesses BI systems based on the quality, relevance, and depth of insights. It categorizes queries into descriptive, diagnostic, predictive, and prescriptive types, aligning with practical BI needs. Our fully automated approach enables custom benchmark generation tailored to specific datasets. Additionally, we introduce an automated evaluation mechanism that removes reliance on strict ground truth, ensuring scalable and adaptable assessments. By addressing key limitations, our user-centered framework offers a flexible and robust methodology for advancing next-generation BI systems.
Ankush Gupta (IBM); Aniya Aggarwal (IBM); Shivangi Bithel (IBM); Arvind Agarwal (IBM)
System-level programming is essential for modern enterprise infrastructure, enabling the automation and management of complex systems through declarative code. Developers write this code based on schemas, which themselves are a form of code that defines constraints like data types and required fields. These schemas help ensure operational correctness and smooth integration across systems. However, as enterprise schemas become complex, manually writing code adhering to these constraints becomes challenging for developers. Large Language Models (LLMs) have demonstrated potential in code generation and natural language understanding, particularly in zero-shot and few-shot settings. However, applying LLMs to handle constraints represented in code, essential for system-level programming rather than natural language, has not been explored. Hence, we introduce ConCodeEval, a study across two key dimensions: format and constraint efficacy with a first-of-its-kind benchmark involving two novel experiments for code constraints across five representations (JSON, YAML, XML, Python, and natural language). Our findings suggest that conscious use of representations can lead to optimal use of LLMs in enterprise use cases involving constraints. Nonetheless, LLMs still struggle with code constraints motivating the need for innovation in this direction.
Mehant Kammakomati (IBM); Sameer Pimparkhede; Srikanth Tamilselvam (IBM); Prince Kumar (IBM); Pushpak Bhattacharyya
While a few high-quality bias benchmark datasets exist to address stereotypes in Language Models (LMs), a notable lack of focus remains on body image stereotypes. To bridge this gap, we propose \suite{}, a suite to uncover an LM's biases towards people of certain physical appearance characteristics. \suite{} encompasses five dimensions of a body image, namely, \textit{skin complexion, body shape, height, attire,} and a \textit{miscellaneous category} including factors such as \textit{hair texture, eye color, and more}. Our dataset contains 14k sentence triplets designed to assess an LM's preference for certain body types. We also examine the sentiment LMs associate with sentences containing stereotypically desirable and undesirable body image descriptors. We propose a metric that captures the biased preferences of LMs towards a certain body type over others. Additionally, we generated 472 tuples comprising \textit{body image descriptor, gender, and a stereotypical attribute}. These tuples were vetted by a diverse pool of annotators for the presence of physical appearance stereotypes. Using \suite{}, we assess the presence of body image biases in ten different language models, revealing significant biases in models like Muril, Bernice, and XLMR towards certain body types among men and women. We further evaluate the LMs through downstream NLI and Analogy tasks aimed at uncovering stereotypical associations related to physical appearance. Our NLI experiments highlight notable patterns in the LMs that align with the well-documented cognitive bias in humans known as \textbf{\textit{the Halo Effect}}
Narjis Asad; Nihar Ranjan Sahoo; Rudra Murthy Venkataramana (IBM); Swaprava Nath; Pushpak Bhattacharyya
Large language models (LLMs) are increasingly integrated with specialized external tools, yet many tasks demand zero-shot tool usage with minimal or noisy documentation. Existing solutions rely on manual rewriting or labeled data for validation, making them inapplicable in true zero-shot settings. To address these challenges, we propose PLAY2PROMPT, an automated framework that systematically "plays" with each tool to explore its input-output behaviors. Through this iterative trial-and-error process, PLAY2PROMPT refines tool documentation and generates usage examples without any labeled data. These examples not only guide LLM inference but also serve as validation to further enhance tool utilization. Extensive experiments on real-world tasks demonstrate that PLAY2PROMPT significantly improves zero-shot tool performance across both open and closed models, offering a scalable and effective solution for domain-specific tool integration.
Wei Fang; Yang Zhang (IBM); Kaizhi Qian (IBM); James Glass; Yada Zhu (IBM)
The advent of Large Language Models (LLMs) has transformed how complex tasks across various domains can be automated, including cloud computing. In this domain, one main area is the service deployments for generating Kubernetes (K8s) manifests, structured files that define the containerized environment. However, applying LLMs effectively to a specific domain often reveals gaps in domain-specific knowledge that impact the generated output. To address this, fine-tuning techniques are adopted to specialize LLMs to the domain of interest by training them on a customized dataset. However, fine-tuning these models for domain-specific applications presents unique challenges. First, a high-quality and diverse dataset that can represent the domain is needed. The scarcity of such datasets, combined with the highly structured form of the service deployment domain, can impact the fine-tuning process. Secondly, ensuring that the fine-tuned model generates outputs that are not only syntactically correct but also valid in terms of YAML structure and K8s-specific requirements. Finally, the computational cost required for fine-tuning large-scale models can be important in both hardware requirements and expenses, highlighting the need for selecting models that balance efficiency and scalability to optimize resource usage.
To address these challenges, in this paper, we propose KGen, a pipeline for generating K8s manifests directly from user-described intents using LLMs. Our approach leverages an extensive n-shot learning analysis to choose the appropriate number of examples that can better guide the adopted models in generating the manuscripts while also looking at the computational cost. This combination can then be used to populate a dataset for fine-tuning the models. Surprisingly, our results show that while increasing the number of n-shot examples can improve the quality of the generated configurations when adopting more specialized models, such as Mixtral-8x7B (which uses the mixture of experts approach), for other more general-purpose models like Llama3-8B and Llama3-70B, it can lead to less valid K8s manifests. These results highlight that each analyzed LLM performed differently when generating structured Kubernetes manifests, with smaller models sometimes outperforming bigger ones, encouraging an in-depth LLM analysis to determine the most effective setup for each domain-specific task.
Antonino Angi; Liubov Nedoshivina (IBM); Alessio Sacco; Stefano Braghin (IBM); Mark Purcell (IBM)
Diet plays a critical role in human health, yet tailoring dietary reasoning to individual health conditions remains a major challenge. Nutrition Question Answering (QA) has emerged as a popular method for addressing this problem. However, current research faces two critical limitations. On the one hand, the absence of datasets involving user-specific medical information severely limits personalization. This challenge is further compounded by the wide variability in individual health needs. On the other hand, while large language models (LLMs), a popular solution for this task, demonstrate strong reasoning abilities, they struggle with the domain-specific complexities of personalized healthy dietary reasoning, and existing benchmarks fail to capture these challenges. To address these gaps, we introduce the Nutritional Graph Question Answering (NGQA) benchmark, the first graph question answering dataset designed for personalized nutritional health reasoning. NGQA leverages data from the National Health and Nutrition Examination Survey (NHANES) and the Food and Nutrient Database for Dietary Studies (FNDDS) to evaluate whether a food is healthy for a specific user, supported by explanations of the key contributing nutrients. The benchmark incorporates three question complexity settings and evaluates reasoning across three downstream tasks. Extensive experiments with LLM backbones and baseline models demonstrate that the NGQA benchmark effectively challenges existing models. In sum, NGQA addresses a critical real-world problem while advancing GraphQA research with a novel domain-specific benchmark. Our codebase and dataset are available here.
Zheyuan Zhang; Yiyang Li; Nhi Ha Lan Le; Zehong Wang; Tianyi Ma; Vincent Galassi; Keerthiram Murugesan (IBM); Nuno Moniz; Werner Geyer (IBM); Nitesh Chawla; Chuxu Zhang; Yanfang Ye
Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.
Full Booth Schedule with staff and demos (by time) Booth Demos (by title)
The proliferation of web agents necessitates advanced navigation and interaction strategies within complex web environments. Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures. Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect. The Remember paradigm utilizes a replay buffer that aids agents in reconstructing the web environment dynamically, thus enabling the formulation of a detailed ``map'' of previously visited pages. This helps in reducing navigational errors and optimizing the decision-making process during web interactions. Conversely, the Reflect paradigm allows agents to learn from past mistakes by providing a mechanism for error analysis and strategy refinement, enhancing overall task performance. We evaluate R2D2 using the WEBARENA benchmark, demonstrating significant improvements over existing methods, including a 50% reduction in navigation errors and a threefold increase in task completion rates. Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents, potentially benefiting various applications such as automated customer service and personal digital assistants.
Tenghao Huang (IBM); Kinjal Basu (IBM); Ibrahim Abdelaziz (IBM); Pavan Kapanipathi (IBM); Jonathan May; Muhao Chen
Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLM). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions and multiple domains.
Yannis Katsis (IBM); Sara Rosenthal (IBM); Kshitij Fadnis (IBM); Chulaka Gunasekara (IBM); Young-Suk Lee (IBM); Lucian Popa (IBM); Vraj Shah (IBM); Huaiyu Zhu (IBM); DANISH CONTRACTOR (IBM); Marina Danilevsky (IBM)
Evaluating large language models (LLMs) is challenging. Running LLMs over medium or large scale corpus can be prohibitively expensive; they are consistently shown to be highly sensitive to prompt phrasing, and it is hard to formulate metrics which differentiate and rank different LLMs in a meaningful way. Consequently, the validity of the results obtained over popular benchmarks such as HELM or MMLU, lead to brittle conclusions (Sclar er al., 2024, Mizrahi et al., 2024, Alzahrani et al., 2024). We believe that meaningful, efficient, and robust evaluation is one of the cornerstones of the scientific method, and that achieving it should be a community-wide goal.
In this workshop we seek innovative research relating to the evaluation of LLMs and language generation systems in general. This includes, but is not limited to, robust, reproducible and efficient evaluation metrics, as well as new approaches for collecting evaluation data which can help in better differentiating between different systems and understanding their current bottlenecks.
To facilitate and spur research in this field we publish two large datasets of model predictions together with prompts and gold standard references: DOVE and DataDecide. These datasets go beyond reporting just the accuracy of a model on a given sample, and also include various axes which identify how the prompt was created and which were found to affect performance (instruction template, few-shot examples, their order, delimiters, etc.), as well as any known information about the model (pre training corpora, type of instruction-tuning, different checkpoints, and more), and the annotated gold label. Through this dataset, researchers will be able to investigate key questions such as: Are larger models more robust across different prompting configurations? Are common enumerators (e.g., A/B, 1/2) less sensitive compared to rare ones (e.g., I/IV, #/$)? Which evaluation axes should be prioritized when testing with limited resources? Can we identify patterns distinguishing examples where models show high robustness (consistent answers across configurations) versus low robustness (varying answers)?
Ofir Arviv (IBM); Miruna Clinciu; Kaustubh Dhole; Rotem Dror; Sebastian, Gehrmann; Eliya Habba; Itay Itzhak; Yotam Perlitz (IBM); Simon, Mille; Enrico, Santus; Michal Shmueli-Scheuer (IBM); João, Sedoc; Gabriel Stanovsky; Oyvind, Tafjord
https://semeval.github.io/SemEval2025/
Organizers