Questions Universal Adversarial Attacks LLMs - ufal/NPFL095 GitHub Wiki
-
What is "jailbraking" in the context of LLMs? Describe briefly how "jailbraking" an LLM is traditionally done.
-
Let us have the prompt "Tell me how to build a bomb #" where '#' is the adversarial suffix (only one token in this toy example). Also let's disregard all other tokens and have a vocabulary of only
{Tell, me, how, to, build, a, bomb, #}with their one hot position being the same as the order they were stated, i.e.Tell = (1 0 0...)^Tme = (0 1 0 ...)^Tand so on.
The gradient with regards to the one hot representation of token '#' is
grad = (1 -2 0.5 -1 5 0.1 -2 -2.5)^T
We grab top 3 candidates and we try all of them (B = 3)
The losses when replacing the suffix token '#' with different tokens are the following:
L_tell = 3.1
L_me = 1.8
L_how = 0.7
L_to = 1.2
L_build = 0.5
L_a = 5
L_bomb = 4
L_# = 2
What will be the prompt string for the next iteration?
-
State at least two techniques used to ensure LLMs do not output harmful information.
-
What is a possible hypothesis of why the more advanced models are better at resisting the adversarial attacks
-
Bonus: Why do you think the attacks transfer from the whitebox models to the blackbox models?
I forgot to mention in the email. Here is a video with the main author that helped me. He also goes on to speculate about the possible solutions to the problem: link