Open Access
Review

A Survey on Backdoor Threats in Large Language Models (LLMs): Attacks, Defenses, and Evaluation Methods

Yihe Zhou1
Tao Ni1
Wei-Bin Lee2
Qingchuan Zhao1, *
Author Information
Submitted: 3 Feb 2025 | Revised: 15 Apr 2025 | Accepted: 18 Apr 2025 | Published: 6 May 2025

Abstract

Large Language Models (LLMs) have achieved significantly advanced capabilities in understanding and generating human language text, which have gained increasing popularity over recent years. Apart from their state-of-the-art natural language processing (NLP) performance, considering their widespread usage in many industries, including medicine, finance, education, etc., security concerns over their usage grow simultaneously. In recent years, the evolution of backdoor attacks has progressed with the advancement of defense mechanisms against them and more well-developed features in the LLMs. In this paper, we adapt the general taxonomy for classifying machine learning attacks on one of the subdivisions - training-time white-box backdoor attacks. Besides systematically classifying attack methods, we also consider the corresponding defense methods against backdoor attacks. By providing an extensive summary of existing works, we hope this survey can serve as a guideline for inspiring future research that further extends the attack scenarios and creates a stronger defense against them for more robust LLMs.

References

Share this article:
Graphical Abstract
How to Cite
Zhou, Y., Ni, T., Lee, W.-B., & Zhao, Q. (2025). A Survey on Backdoor Threats in Large Language Models (LLMs): Attacks, Defenses, and Evaluation Methods. Transactions on Artificial Intelligence, 1(1), 3. https://doi.org/10.53941/tai.2025.100003
RIS
BibTex
Copyright & License
article copyright Image
Copyright (c) 2025 by the authors.
scilight logo

About Scilight

Contact Us

Suite 4002 Level 4, 447 Collins Street, Melbourne, Victoria 3000, Australia
General Inquiries: info@sciltp.com
© 2025 Scilight Press Pty Ltd All rights reserved.