Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss LandscapesXiaomeng XuPin-Yu Chenet al.2024NeurIPS 2024
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language ModelsXiaomeng XuPin-Yu Chenet al.2025AAAI 2025