GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative ModelsZhaitang LiPin-Yu Chenet al.2024NeurIPS 2024
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss LandscapesXiaomeng XuPin-Yu Chenet al.2024NeurIPS 2024
WAGLE: Strategic Weight Attribution for Effective and Modular Unlearning in Large Language ModelsJinghan JiaJiancheng Liuet al.2024NeurIPS 2024
Privacy without Noisy Gradients: Slicing Mechanism for Generative Model TrainingKristjan GreenewaldYuancheng Yuet al.2024NeurIPS 2024
Multivariate Stochastic Dominance via Optimal Transport and Applications to Models BenchmarkingGabriel RiouxApoorva Nitsureet al.2024NeurIPS 2024
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from WikipediaYufang HouAlessandra Pascaleet al.2024NeurIPS 2024
Weak Supervision Performance Evaluation via Partial IdentificationFelipe Maia PoloSubha Maityet al.2024NeurIPS 2024
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language ModelsChia-yi HsuYu-Lin Tsaiet al.2024NeurIPS 2024
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language ModelsShengyun PengPin-Yu Chenet al.2024NeurIPS 2024