Chen Xiong, Xiangyu Qi, et al.
ACL 2025
Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus perform the most detailed investigation to date on whether LLMs can reliably identify security-related bugs. We construct a series of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions in an automated framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios outside their knowledge cut-off date. Most importantly, our findings reveal significant non-robustness in even the most advanced models like PaLM2' and
GPT-4': by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants.
Chen Xiong, Xiangyu Qi, et al.
ACL 2025
Matías Mazzanti, Esteban Mocskos, et al.
ISCA 2025
Zhiyuan He, Yijun Yang, et al.
ICML 2024
Teryl Taylor, Frederico Araujo, et al.
Big Data 2020