Stealthy Poisoning Attack on Certified Robustness
Abstract
Certifiably robust classifiers have a constant prediction around a neighborhood of a point, which makes them resilient to test-time attacks with a guarantee. In this work, we present a previously unrecognized threat to robust machine learning models. Specifically, we propose a data poisoning attack to degrade the robustness guarantees of certifiably robust classifiers. Unlike other data poisoning attacks that reduce the accuracy of the poisoned models on a small set of target points, our attack can reduce the average certified radius of an entire target class in the dataset while ensuring high accuracy of the classifiers on clean data. Clean label poisoning points with imperceptible distortion and high accuracy of the poisoned models make our attack hard to detect. Moreover, the attack is effective even when the victim trains the models from scratch and uses Gaussian data augmentation. By poisoning MNIST and CIFAR10 datasets and training deep neural networks on them, we show the effectiveness of our attack in degrading the certified robustness guarantees obtained using randomized smoothing. Our results highlight the importance of data quality for achieving high certified robustness guarantees.