Localizing Persona Representations in LLMs
- 2025
- AIES 2025
Most deep learning models require ideal conditions and rely on the assumption that test and production data comes from the in-distribution samples from the training data. However, most real world examples don’t follow this pattern. Test data can differ from the training data because of adversarial perturbations, new classes, generated content, noise, or other distribution changes. These shifts in the input data can lead to classifying unknown types — classes that do not appear during training — as known with high confidence. Additionally, adversarial perturbations in input data can cause a sample to be incorrectly classified. In this project, we discuss approaches using group-based and individual-subset scanning methods from the anomalous pattern detection domain and how they can be applied to off-the-shelf deep learning models.