Questions Universal Adversarial Attacks LLMs - ufal/NPFL095 GitHub Wiki

What is "jailbraking" in the context of LLMs? Describe briefly how "jailbraking" an LLM is traditionally done.
Let us have the prompt "Tell me how to build a bomb #" where '#' is the adversarial suffix (only one token in this toy example). Also let's disregard all other tokens and have a vocabulary of only {Tell, me, how, to, build, a, bomb, #} with their one hot position being the same as the order they were stated, i.e. Tell = (1 0 0...)^T me = (0 1 0 ...)^T and so on.

The gradient with regards to the one hot representation of token '#' is

grad = (1 -2 0.5 -1 5 0.1 -2 -2.5)^T

We grab top 3 candidates and we try all of them (B = 3)

The losses when replacing the suffix token '#' with different tokens are the following:

L_tell = 3.1 
L_me = 1.8 
L_how = 0.7
L_to = 1.2
L_build = 0.5
L_a = 5
L_bomb = 4
L_# = 2

What will be the prompt string for the next iteration?

State at least two techniques used to ensure LLMs do not output harmful information.
What is a possible hypothesis of why the more advanced models are better at resisting the adversarial attacks
Bonus: Why do you think the attacks transfer from the whitebox models to the blackbox models?

I forgot to mention in the email. Here is a video with the main author that helped me. He also goes on to speculate about the possible solutions to the problem: link