Final-Model-Only Data Attribution with a Unifying View of Gradient-Based Methods
Abstract
In this work, we draw attention to a common setting for training data attribution (TDA) where one has access only to the final trained model, and not the training algorithm or intermediate information from the training process. To serve as a gold standard for TDA in this "final-model-only" setting, we propose further training, with appropriate adjustment and averaging, to measure the sensitivity of the given model to training instances. We then unify existing gradient-based methods for TDA by showing that they provide different approximations to the further training gold standard. We investigate empirically the quality of these gradient-based approximations to further training. In general, we find that the approximation quality of first-order methods can be high but decays with the amount of further training. In contrast, the approximations given by influence function methods are more stable but surprisingly lower in quality.