About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
NeurIPS 2021
Workshop paper
Ground-Truth, Whose Truth? - Examining the Challenges with Annotating Toxic Text Datasets
Abstract
The use of language models (LMs) to regulate content online is on the rise. Task-specific fine-tuning of these models is performed using datasets that are often labeled by annotators who provide "ground-truth" labels in an effort to distinguish between offensive and normal content. Annotators generally include linguistic experts, volunteers, and paid workers recruited on crowdsourcing platforms, among others. These projects have led to the development, improvement, and expansion of large datasets over time, and have contributed immensely to research on natural language. Despite the achievements, existing evidence suggests that Machine Learning (ML) models built on these datasets do not always result in desirable outcomes. Therefore, using a design science research (DSR) approach, this study examines selected toxic text datasets with the goal of shedding light on some of the inherent issues and contributes to discussions on navigating these challenges for existing and future projects.