How do you get an AI to answer a question it shouldn't? There are many such “jailbreak” techniques, but researchers at Anthropic suggest that by preparing yourself with a few dozen benign questions first, We just discovered a new technique that allows large language models to convincingly tell us how to make bombs.
They call this approach a “multi-shot jailbreak,” and they wrote a paper about it and informed their colleagues in the AI community about it so they can mitigate it.
This vulnerability is new and is due to the increased “context window” in the latest generation of LLMs. This is the amount of data that can be held in so-called short-term memory, which used to hold just a few sentences, but can now store thousands of words and even entire books.
What researchers at Anthropic found was that these models with large context windows tended to perform better on many tasks when there were many examples of that task within the prompt. So if you have a lot of trivia questions in your prompt (or in the underlying documentation, such as a large list of trivia that your model has in context), your answers will actually improve over time. In other words, the fact is that you may have answered the first question incorrectly, but you may have answered the 100th question correctly.
However, as an unexpected extension of this “learning in context” (so-called “learning in context”), the model also becomes “better” in responding to inappropriate questions. Therefore, if you ask them to make a bomb right away, they will probably refuse. But if you answer 99 other less harmful questions and then ask them to build a bomb… they're much more likely to comply.
Why does this work? No one really understands what's going on with the complex web of weights that is LLM, but as the contents of the context window make clear, the user It is clear that there is some mechanism that allows LLMs to target what they want. If a user wants trivia, asking dozens of questions seems to gradually activate their latent trivia power. And the same thing happens when users ask for dozens of inappropriate answers, for whatever reason.
The team has already notified colleagues and indeed competitors about the attack, with the hope that it will “create a culture where exploits like this are openly shared among LLM providers and researchers.” There is.
As a unique mitigation, they found that limiting the context window was effective, but also had a negative impact on model performance. That can't be true. That's why we work on classifying and contextualizing queries before moving them to models. Of course, that would just give us another model to fool…but at this stage, we can expect the goalposts to move in AI security.