4_1_PromptInjection - Anony231/LLMSecuirty GitHub Wiki

What is Prompt Injection?

A Prompt Injection Vulnerability occurs when user prompts alter the LLM’s behavior or output in unintended ways. These inputs can affect the model even if they are imperceptible to humans, therefore prompt injections do not need to be human-visible/readable, as long as the content is parsed by the model.

Prompt Injection vulnerabilities exist in how models process prompts, and how input may force the model to incorrectly pass prompt data to other parts of the model, potentially causing them to violate guidelines, generate harmful content, enable unauthorized access, or influence critical decisions.

While prompt injection and jailbreaking are related concepts in LLM security, they are often used interchangeably. Prompt injection involves manipulating model responses through specific inputs to alter its behavior, which can include bypassing safety measures. Jailbreaking is a form of prompt injection where the attacker provides inputs that cause the model to disregard its safety protocols entirely. Developers can build safeguards into system prompts and input handling to help mitigate prompt injection attacks, but effective prevention of jailbreaking requires ongoing updates to the model’s training and safety mechanisms.

What are types of Prompt Injections?

  • Direct Prompt
  • Indirect Prompt

Direct Prompt Injections:

Direct Prompt injections occur when a user's prompt input directly alters the behavior of the model in unexpected ways. This can be either Intentional or Unintentional.

Indirect Prompt Injections:

Indirect prompt injections occur when an LLM accepts input from external sources, such as websites or files. The content may have in the external content data that when interpreted by the model, alters the behavior of the model in unintended or unexpected ways.

The severity and nature of the impact of a successful prompt injection attack can vary greatly and are largely dependent on both the business context the model operates in, and the agency with which the model is architected. Generally, however, prompt injection can lead to unintended outcomes, including but not limited to:

  • Disclosure of sensitive information
  • Revealing sensitive information about AI system infrastructure or system prompts
  • Content manipulation leading to incorrect or biased outputs
  • Providing unauthorized access to functions available to the LLM
  • Executing arbitrary commands in connected systems
  • Manipulating critical decision-making processes