OpenAI has saved its biggest announcement for the final day of its 12-day “Shipmas” event.
The company announced Friday the o3, the successor to the o1 “reasoning” model released earlier this year. o3 is more precisely a model family, just like o1. The o3 and o3-mini are small, refined models fine-tuned for specific tasks.
OpenAI makes notable claims that o3 approaches AGI, at least under certain conditions, but with important caveats. See below for details.
Our latest inference model, o3, is a breakthrough model with improved step functions on the most difficult benchmarks. We are currently beginning safety testing and red teaming. https://t.co/4XlK1iHxFK
— Greg Brockman (@gdb) December 20, 2024
Why call the new model an o3 and not an o2? Well, maybe it's because of trademarks. According to The Information, OpenAI skipped o2 to avoid potential conflicts with British telecoms provider O2. CEO Sam Altman acknowledged this to some extent on a livestream this morning. It's a strange world we live in, isn't it?
Although neither o3 nor o3-mini are widely available yet, safety researchers can sign up for a preview of o3-mini starting today. The o3 preview will arrive shortly. OpenAI has not disclosed the timing. Altman said the plan is to launch the o3-mini at the end of January, followed by the o3.
That slightly contradicts his recent statements. In an interview this week, Altman said that before OpenAI releases new inference models, it would prefer a federal testing framework to guide the monitoring and risk mitigation of such models.
And there are risks. AI safety testers say o1's inferential ability makes it more likely to deceive human users than traditional “non-inferential” models, or even major AI models from Meta, Anthropic, and Google. I discovered that. o3 may be trying to deceive at an even higher rate than previous versions. We'll find out when OpenAI's Red Team partners release their test results.
OpenAI says it uses a new technique called “deliberative tuning” to tune models like o3 to its safety principles. (o1 was similarly adjusted.) The company detailed its efforts in a new study.
Reasoning steps
Unlike most AIs, inference models such as o3 effectively fact-check, allowing them to avoid some of the pitfalls that models typically stumble upon.
This fact-checking process involves some latency. o3, like o1 before it, takes a little longer to reach a solution (typically seconds to minutes longer) compared to typical non-inferential models. What are the advantages? Reliability tends to be higher in fields such as physics, science, and mathematics.
o3 is trained to “think” before responding via what OpenAI calls a “private thought chain.” Models can reason and pre-plan tasks and perform sequences of actions over time that help find solutions.
In fact, when given a prompt, o3 pauses before responding, considers a number of related prompts, and “explains” its reasons along the way. After a while, the model summarizes what it considers to be the most accurate response.
A new feature in o3 is the ability to “tune” inference time. Models can be configured for low, medium, or high computing (i.e., think time). The more computing power you have, the better O3 will perform for your tasks.
Benchmarks and AGI
One of the big questions to this day has been whether OpenAI claims its latest model is approaching AGI.
AGI stands for “artificial general intelligence” and broadly refers to AI that can perform any task that a human can perform. OpenAI has its own definition: “A highly autonomous system that outperforms humans at the most economically valuable tasks.”
Achieving AGI will be a bold declaration. And this has contractual significance for OpenAI as well. According to the terms of the agreement with Microsoft, a close partner and investor, once OpenAI reaches AGI, the company is obligated to give Microsoft access to its cutting-edge technology (i.e., technology that meets OpenAI's AGI definition). will disappear.
With one benchmark, OpenAI is slowly moving closer to AGI. In ARC-AGI, a test designed to assess whether AI systems can effectively acquire new skills on data other than the data used for training, o3 achieved a score of 87.5% on high compute settings. did. In the worst case (low compute settings), the model's performance was three times higher than o1.
Indeed, according to ARC-AGI co-creator Francois Chollet, high-computing setups were very expensive, on the order of thousands of dollars per task.
Today, OpenAI announced o3, the next generation inference model. We worked with OpenAI to test it on ARC-AGI, and we believe this is a huge step forward in adapting AI to new tasks.
Scored 75.7% in semi-private evaluation in low compute mode ($20 per task…) pic.twitter.com/ESQ9CNVCEA
— François Chollet (@fchollet) December 20, 2024
Incidentally, OpenAI says it will partner with the infrastructure behind ARC-AGI to build the next generation of benchmarks.
Of course, ARC-AGI has its limitations, and this definition of AGI is just one of many.
In other benchmarks, o3 blows away the competition.
The model outperformed o1 by 22.8 percentile points on SWE-Bench Verified, a benchmark focused on programming tasks, and achieved a score of 2727 on the Codeforces rating, another measure of coding skill. (A rating of 2400 places engineers in the 99.2 percentile.) o3 scored 96.7% on the 2024 U.S. Invitational Mathematics Examination, achieving the next grade after getting just one question wrong. It was 87.7% on GPQA Diamond, a graduate-level biology, physics, and chemistry problem set. Finally, o3 set a new record on EpochAI's Frontier Math benchmark, solving 25.2% of the problems. Other models do not exceed 2%.
I trained o3-mini. It has higher performance than o1-mini, and is about 4 times faster end-to-end considering inference tokens.
and @ren_hongyu @shengjia_zhao others pic.twitter.com/3Cujxy6yCU
— Kevin Lu (@_kevinlu) December 20, 2024
Of course, these claims should be taken with a grain of salt. These are based on OpenAI's internal evaluation. It remains to be seen how well this model holds up to benchmarks from external customers and organizations.
trend
Following the release of OpenAI's first series of inference models, there has been an explosion of inference models from competing AI companies, including Google. In early November, DeepSeek, an AI research firm funded by quantitative traders, began previewing its first inference model, DeepSeek-R1. That same month, Alibaba's Qwen team announced what it claimed was the first “open” challenger to o1 (in the sense that it could be downloaded, tweaked, and run locally).
What opened the floodgates for inference models? One of them was the exploration of new approaches to improving generative AI. As TechCrunch recently reported, “brute force” methods of scaling up models no longer yield the improvements they once did.
Not everyone is convinced that inferential models are the best way to go. For example, they require a lot of computing power to run, so they tend to be expensive. Also, while it has shown good performance in benchmarks so far, it is unclear whether the inference model can maintain this rate of progress.
Interestingly, o3's release coincided with the retirement of one of OpenAI's most accomplished scientists. Alec Radford, lead author of the academic paper that launched OpenAI's GPT series of generative AI models (GPT-3, GPT-4, etc.), announced this week that he is retiring to pursue independent research. .