Deep Scanner

A tale of adversarial attacks & out-of-distribution detection stories in the activation space

Overview

Most deep learning models require ideal conditions and rely on the assumption that test and production data comes from the in-distribution samples from the training data. However, most real world examples don’t follow this pattern. Test data can differ from the training data because of adversarial perturbations, new classes, generated content, noise, or other distribution changes. These shifts in the input data can lead to classifying unknown types — classes that do not appear during training — as known with high confidence. Additionally, adversarial perturbations in input data can cause a sample to be incorrectly classified. In this project, we discuss approaches using group-based and individual-subset scanning methods from the anomalous pattern detection domain and how they can be applied to off-the-shelf deep learning models.

These upset and venn plots show the overlap of activations across intra-persona topics. This suggests that persona dimensions are more distinctly localized, while ethical statements are more polysemous.

Publications

Localizing Persona Representations in LLMs
- - Celia Cintas
  - Miriam Rateike
  - et al.
- 2025
- AIES 2025
The Impact of Domain Adaptation on the Activation Space of LLMs
- - Assala Benmalek
  - Celia Cintas
  - et al.
- 2025
- DLI 2025
Property-driven localization and characterization in molecular representations
- - Celia Cintas
  - Payel Das
  - et al.
- 2025
- Nature Scientific Reports
Weakly Supervised Detection of Hallucinations in LLM Activations
- - Miriam Rateike
  - Celia Cintas
  - et al.
- 2023
- NeurIPS 2023
Characterizing pre-trained and task-adapted molecular representations
- - Celia Cintas
  - Payel Das
  - et al.
- 2023
- NeurIPS 2023
Spatially Constrained Adversarial Attack Detection and Localization in the Representation Space of Optical Flow Networks
- - Hannah Kim
  - Celia Cintas
  - et al.
- 2023
- IJCAI 2023
TRAD: Task-agnostic Representation of the Activation Space in Deep Neural Networks
- - Tanya Leah Akumu
  - Celia Cintas
  - et al.
- 2023
- IJCAI 2023
Foundational Robustness of Foundation Models
- - Pin-Yu Chen
  - Sijia Liu
  - et al.
- 2022
- NeurIPS 2022

Contributors

SS

Skyler Speakman

Skyler Speakman

TA

Tanya Akumu

Tanya Akumu

CC

Celia Cintas

Celia Cintas

MR

Miriam Rateike

Miriam Rateike