About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
SPIE DCS 2019
Conference paper
Managing training data from untrusted partners using self-generating policies
Abstract
When training data for machine learning is obtained from many different sources, not all of which may be trusted, it is difficult to determine which training data to accept and which to reject. A policy-based approach for data curation, where the policies are generated after examining the properties of the offered data, can provide a way to only accept selected data for creating a machine learning model. In this paper, we discuss the challenges associated with generating policies that can manage training data from different sources. An efficient policy generation scheme needs to determine the order in which information is received, must have an approach to determine the trustworthiness of each partner, must have an approach to decide how to quickly assess which data subset can add value to a complex model, and must address several other issues. After providing an overview of the challenges, we propose approaches to solve them and study the properties of those approaches.