Classifier invariant approach to learn from positive-unlabeled data
Abstract
Learning from positive (P) and unlabeled (U) data has a rich history as it finds use in multiple applications. In this paper, we provide a novel framework to tackle this problem in a model agnostic fashion. We say model agnostic since, our solution involves identifying and weighting positive as well as negative examples in an unsupervised manner which could then be passed as input to any standard classification algorithm. Moreover, based on our framework we provide approximation guarantees for our algorithm in terms of how well the identified positive examples from U along with their weights match the distribution of P. Such a principled approach has been missing for other methods that belong to the model agnostic category, not to mention that the current state-of-the-art are model dependent strategies that involve modifying the training algorithm. For Kernel Support Vector Machines, trained on a (non-negative) weighted dataset that as such is the output of our method, we derive generalization bounds. Given the advantages of having model agnostic methods (viz. use with (almost) any classifier, one time running cost), we show that our algorithm which possesses these benefits, is competitive with the best methods based on experiments on three real datasets. In fact, in a couple of cases we observe that our approach has better test performance than even standard supervised learning which has access to all positive as well as negative labels.