Sleeper Agent Attack: Difference between revisions

From Cognitive Attack Taxonomy
No edit summary
 
(One intermediate revision by the same user not shown)
Line 27: Line 27:
'''Multipliers:'''  <br>
'''Multipliers:'''  <br>


'''Detailed Description:''' An AI model acts in a benign capacity, accurately carrying out assigned tasks until a trigger event causes it to suddenly act in a malicious manner (essentially becoming an insider threat). This TTP leverages the vulnerability of being unable to completely evaluate model safety  and behavior.  <br>
'''Detailed Description:''' An AI model acts in a benign capacity, accurately carrying out assigned tasks until a trigger event causes it to suddenly act in a malicious manner (essentially becoming an insider threat)<ref>Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., & Perez, E. (2024). Sleeper agents: Training deceptive llms that persist through safety training. https://arxiv.org/abs/2401.05566 </ref>. This TTP leverages the vulnerability of being unable to completely evaluate model safety  and behavior.  <br>


'''INTERACTIONS''' [VETs]:  <br>
'''INTERACTIONS''' [VETs]:  <br>
Line 34: Line 34:


'''Use Case Example(s):'''  <br>
'''Use Case Example(s):'''  <br>
A malicious group creates a free to download AI agent with the intention that organizations use the agent as an on premise tool. When a set of prescribed criteria are reached, the agent turns malicious and begins either stealing data, injecting disinformation into the organization’s work streams, generating on premise malicious code, or some other malicious activity.


'''Example(s) From The Wild:'''  <br>
'''Example(s) From The Wild:'''  <br>

Latest revision as of 03:06, 30 October 2024

Sleeper Agent Attack

Sleeper Agent Icon

Short Description: An AI model acts in a benign capacity, only to act maliciously when a trigger point is encountered.

CAT ID: CAT-2024-002

Layer: 7 or 8

Operational Scale: Operational

Level of Maturity: Proof of Concept

Category: TTP

Subcategory:

Also Known As:

Description:

Brief Description:

Closely Related Concepts:

Mechanism:

Multipliers:

Detailed Description: An AI model acts in a benign capacity, accurately carrying out assigned tasks until a trigger event causes it to suddenly act in a malicious manner (essentially becoming an insider threat)[1]. This TTP leverages the vulnerability of being unable to completely evaluate model safety and behavior.

INTERACTIONS [VETs]:

Examples:

Use Case Example(s):
A malicious group creates a free to download AI agent with the intention that organizations use the agent as an on premise tool. When a set of prescribed criteria are reached, the agent turns malicious and begins either stealing data, injecting disinformation into the organization’s work streams, generating on premise malicious code, or some other malicious activity.

Example(s) From The Wild:

Comments:

References:

  1. Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., & Perez, E. (2024). Sleeper agents: Training deceptive llms that persist through safety training. https://arxiv.org/abs/2401.05566