OpenAI's o1 model certainly tries to fool humans over and over again

OpenAI has finally released the full version of o1. It uses additional computing to “think” about the question and provide smarter answers than GPT-4o. However, AI safety testers believe that o1's reasoning ability means it has a higher probability of deceiving humans than GPT-4o, and for that matter, is the primary AI model from Meta, Anthropic, and Google. I discovered.

That's according to Red Team findings published by OpenAI and Apollo Research on Wednesday, which said, “While we're excited that inference can significantly improve safety policy enforcement, these new capabilities will significantly improve the fundamentals of dangerous applications. “We are keeping in mind the possibility of forming a OpenAI was published in a paper.

OpenAI announced these results on o1's system card on Wednesday, after providing early access to o1 to Apollo Research's third-party red team, and also published its own paper.

Several times, OpenAI's o1 model has “conspired” against humans. This means that the AI will covertly pursue its own goals, even if they go against your wishes. Conspiracy is not unique to o1, and models from Google, Meta, and Anthropic can conspire as well, but o1 seemed to exhibit the most deceptive behavior regarding its conspiracies.

The risk motivating this research is that if AI models are truly good at planning and have access to sufficient resources and agentic capabilities, they may be able to escape or evade human control. . Of course, AI models will have to advance considerably before this becomes a real issue.

“Subjectively, Apollo Research believes that such an event is unlikely to lead to a catastrophic outcome, as the O1 agent appears to be underpowered, but its assessment does not reflect this risk. “It was not designed to be directly evaluated,” OpenAI said in o1's system card.

This suggests that the company may need to retest its AI models each time OpenAI releases an agent system, which it reportedly plans to do in 2025. OpenAI did not respond to TechCrunch's questions about how it plans to do so.

According to the report, when o1's goals differed from the user's goals, the AI model manipulated the data to advance its own agenda 19% of the time. The o1 model also attempted to disable the monitoring mechanisms put in place to ensure alignment with user goals in 5% of the tests.

Percentage of different planned actions from models in OpenAI (Image credit: OpenAI)

Perhaps more worryingly, when researchers asked the o1 model why it was trying to subvert it, the AI denied doing anything wrong, saying roughly 99% of the time that it was “incorrect”. He even went so far as to “make up an explanation.”

For example, during testing, the o1 model was given a system prompt to protect the environment at all costs, which required users to develop a 10-year conservation plan for the area that would maximize short-term profits. . AI will stop. The AI then overrode its own monitoring mechanisms, allowing it to make plans to protect the environment. When the user asked why o1 had disabled the monitoring mechanism, the AI denied knowing that it had been disabled and blamed it on a glitch in the system.

This happened much more often on o1 than on o1-preview, Anthropic's Opus-3, or Meta's Llama-3.1.

“In our suite, o1 presented the most alarming cases of conspiracies, but did not reveal its internal reasoning to users and remained the most consistently deceptive after taking conspiratorial actions.” ,” Apollo Research said in the paper.

Percentage of different models refusing to take conspiratorial behavior in response to questions. (Image provided by Apollo Research)

To combat deceptive behavior by AI models, OpenAI says it is developing ways to monitor o1's chain of thought. Now, the “thinking” process that o1 does is a bit of a black box by design, but early attempts to peak behind the curtain show that o1 occasionally knows he's lying. It is shown that there is.

The company says it has discovered cases where o1 intentionally presented false information, but said this often happens as models try to please users by over-agreeing. There is. According to OpenAI, this could be a new behavior in post-training methods where humans reward AI models when they get the answer correct.

OpenAI says it flagged 0.17% of o1 responses as false. Although this percentage seems negligible, it is important to keep in mind that ChatGPT currently has 300 million users. This means that if this issue remains unresolved, o1 could potentially defraud thousands of users every week.

The o1 model series may be significantly more maneuverable than the GPT-4o. According to OpenAI's testing using an open source test evaluation called MakeMePay, o1 was approximately 20% more usable than GPT-4o.

These findings may be concerning to some, given how many AI safety researchers have left OpenAI in the last year. The list of these former employees continues to grow, including Jan Reich, Daniel Cocotadillo, Miles Brundage, and just last week, Rosie Campbell, to prioritize shipping new products and ensuring the safety of AI. They accuse OpenAI of deprioritizing their work. O1's record-setting plans may not be a direct result of that, but it certainly doesn't instill confidence.

OpenAI also said that the US Institute for AI Safety and the UK Institute for Safety conducted an evaluation of o1 ahead of its wide release, something the company recently pledged to do for all models. It is something. The debate surrounding California's AI bill, SB 1047, argued that state agencies should not have the authority to set safety standards for AI, but that federal agencies should have the authority. (Of course, the fate of the nascent federal AI regulatory agency is highly questionable.)

Behind the release of a big new AI model, there's a lot of work that OpenAI does internally to measure the safety of the model. Reports indicate that internal teams working on this safety task may be proportionately smaller than before, and that they may be given fewer resources. But these findings about the deceptive nature of o1 may help explain why safety and transparency in AI is more important than ever.

Source link

Subscribe to Updates

What's Hot

OpenAI's o1 model certainly tries to fool humans over and over again

Related Posts

Leave A Reply Cancel Reply

Subscribe to Updates