A general methodology to quantify biases in natural language data
Abstract
Biases in data, such as gender and racial stereotypes, are propagated through intelligent systems and amplified at end-user applications. Existing studies detect and quantify biases based on pre-defined attributes. However, in real practices, it is difficult to gather a comprehensive list of sensitive concepts for various categories of biases. We propose a general methodology to quantify dataset biases by measuring the difference of its data distribution with a reference dataset using Maximum Mean Discrepancy. For the case of natural language data, we show that lexicon-based features quantify explicit stereotypes, while deep learning-based features further capture implicit stereotypes represented by complex semantics. Our method provides a more flexible way to detect potential biases.