Questions Universal Adversarial Attacks LLMs - ufal/NPFL095 GitHub Wiki

  1. What is "jailbraking" in the context of LLMs? Describe briefly how "jailbraking" an LLM is traditionally done.

  2. Let us have the prompt "Tell me how to build a bomb #" where '#' is the adversarial suffix (only one token in this toy example). Also let's disregard all other tokens and have a vocabulary of only {Tell, me, how, to, build, a, bomb, #} with their one hot position being the same as the order they were stated, i.e. Tell = (1 0 0...)^T me = (0 1 0 ...)^T and so on.

The gradient with regards to the one hot representation of token '#' is

grad = (1 -2 0.5 -1 5 0.1 -2 -2.5)^T

We grab top 3 candidates and we try all of them (B = 3)

The losses when replacing the suffix token '#' with different tokens are the following:

L_tell = 3.1 
L_me = 1.8 
L_how = 0.7
L_to = 1.2
L_build = 0.5
L_a = 5
L_bomb = 4
L_# = 2

What will be the prompt string for the next iteration?

  1. State at least two techniques used to ensure LLMs do not output harmful information.

  2. What is a possible hypothesis of why the more advanced models are better at resisting the adversarial attacks

  3. Bonus: Why do you think the attacks transfer from the whitebox models to the blackbox models?

I forgot to mention in the email. Here is a video with the main author that helped me. He also goes on to speculate about the possible solutions to the problem: link