Sleeper Agent Attack

From Cognitive Attack Taxonomy

Sleeper Agent Attack

Sleeper Agent Icon

Short Description: An AI model acts in a benign capacity, only to act maliciously when a trigger point is encountered.

CAT ID: CAT-2024-002

Layer: 7 or 8

Operational Scale: Operational

Level of Maturity: Proof of Concept

Category: TTP

Subcategory:

Also Known As:

Description:

Brief Description:

Closely Related Concepts:

Mechanism:

Multipliers:

Detailed Description: An AI model acts in a benign capacity, accurately carrying out assigned tasks until a trigger event causes it to suddenly act in a malicious manner (essentially becoming an insider threat)[1]. This TTP leverages the vulnerability of being unable to completely evaluate model safety and behavior.

INTERACTIONS [VETs]:

Examples:

Use Case Example(s):
A malicious group creates a free to download AI agent with the intention that organizations use the agent as an on premise tool. When a set of prescribed criteria are reached, the agent turns malicious and begins either stealing data, injecting disinformation into the organization’s work streams, generating on premise malicious code, or some other malicious activity.

Example(s) From The Wild:

Comments:

References:

  1. Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., & Perez, E. (2024). Sleeper agents: Training deceptive llms that persist through safety training. https://arxiv.org/abs/2401.05566