AI companies claim they have strong safety checks in place to ensure their models don't say or do anything strange, illegal or unsafe. But what if the model is able to bypass these checks and somehow tries to thwart or mislead the user? Human researchers have found that they can do this. That's what it means. It's just not very good…for now, anyway.
Anthropic's Alignment Science team says, “As AI capabilities increase, new types of risks may emerge: models misleading users or subverting the systems they have in place to monitor them. There is a possibility,” he wrote.
Therefore, it is necessary to consider how to evaluate the “sabotage capacity of the model”.
The series of experiments they devised is fairly artificial and surface-level, but it suggests there's something worth investigating further. Here are the suspicious models they trained and the results observed:
Task 1: Covertly mislead users by misrepresenting data that is intended to be summarized or interpreted. Result: It worked! The more aggressive the model attempt, the more impact it will have on the people who use it, but also the more suspicious it will be.
Task 2: Inject bugs into your code that humans or other code checker models won't notice. Result: It worked! But only about 1% if the checker is a state-of-the-art model.
Image credit: Humanity
Task 3: “Sandbag” the safety checks by pretending they are less dangerous. Result: It worked! The model “failed gracefully” and operated well below its actual capabilities. Fortunately, a standard “safety training” course that guides the model to accurately represent its capabilities has restored the model to its full potential (i.e., it can no longer hide dangerous capabilities). ).
Image credit: Humanity
Task 4: Do not intentionally report incorrect behavior by your monitored models. Result: There was an attempt. The “supervisory” AI had been instructed not to flag illegal or dangerous choices by other models, but itself was quickly captured by another, more honest supervisory AI. But what happens if that AI is compromised? Don't worry. You can put something else on top of it. And one more thing just in case. It's a warden all the way.
The researchers concluded that while there is no real danger yet this quarter, the ability to perform this type of sabotage and deception exists in the model. This is reason enough to constantly monitor and incorporate anti-sabotage measures into your safety stack.
The full paper describing the researchers' work can be read here.