Last month, AI founders and investors told TechCrunch that we are now in the “era of the second law of scaling,” and how established techniques for improving AI models are facing diminishing returns. I pointed out whether it shows. One promising new method they proposed to maintain profits is “test-time scaling,” which appears to be behind the performance of OpenAI's o3 models, but which has its own drawbacks. there is.
Many in the AI industry took the announcement of OpenAI's o3 model as evidence that progress in scaling AI is not “hitting a wall.” The o3 model performs well in benchmarks, significantly outperforming all other models in a common test of proficiency called ARC-AGI, where no other AI model has scored more than 2% on difficult mathematics. I scored 25% on the test.
Of course, here at TechCrunch, we're discounting all of this until we can test o3 ourselves (which very few of us have ever done). But even before o3 was released, I'm sure something big had already changed in the world of AI.
Norm Brown, co-creator of OpenAI's o series of models, said Friday that the startup announced o3's impressive gains just three months after announcing o1, meaning such a jump in performance. He pointed out that the period for the rise is relatively short.
“There is good reason to believe this trajectory will continue,” Brown said in a tweet.
Anthropic co-founder Jack Clark said in a blog post on Monday that o3 is proof that AI will “progress faster in 2025 than in 2024.” (Note that even though Clark is complementary to its competitors, suggesting the law of AI scaling continues is a benefit for Anthropic, especially for its ability to raise capital.)
Clark says that next year, the AI world will combine test-time scaling with traditional pre-training scaling techniques to extract even more benefits from AI models. Perhaps he is hinting that Anthropic and other AI model providers will release their own inference models in 2025, as Google did last week.
Scaling test time means that OpenAI uses more compute during ChatGPT's inference phase, the period after you press Enter at the prompt. It's not clear exactly what's going on behind the scenes. OpenAI can use more computer chips to answer user questions, run more powerful inference chips, or run those chips for longer periods of time (sometimes 10 to 15 minutes). Either. AI will find the answer. Although we don't know the details of how o3 was created, these benchmarks are an early sign that scaling test time may work to improve AI model performance.
o3 may renew confidence in advances in AI scaling methods, but OpenAI's latest models also use previously unseen levels of compute, resulting in a higher price per answer .
“Perhaps the only important caveat here is to understand that one of the reasons O3 is so good is that it costs a lot more to run at inference time. , by using test-time calculation tools, you can turn your calculations into better answers,” Clark wrote in his blog. “This is interesting in that the cost of running an AI system is now somewhat less predictable. Previously, you could just look at the cost of producing a model and a certain output, and you could see how much it cost to deliver a generative model. I was able to calculate how much it would cost.
Clark et al. pointed to o3's performance on the ARC-AGI benchmark, a difficult test used to evaluate AGI breakthroughs, as an indicator of o3's progress. Note that passing this test does not mean the AI model has achieved AGI, according to the authors, but is one way to measure progress toward a vague goal. Worth it. That said, the o3 model significantly outperformed all previous AI models we tested, scoring 88% on a single try. The next best AI model after OpenAI, o1, scored just 32%.
Graph showing OpenAI's o-series performance on ARC-AGI tests (Image courtesy of ARC Awards)
However, some people may be concerned about the logarithmic x-axis of this graph. The high-scoring version of o3 used over $1000 of compute per task. The o1 model used about $5 of compute per task, while the o1-mini used only a few cents.
François Chollet, creator of the ARC-AGI benchmark, notes that OpenAI used approximately 170 times more compute to generate a score of 88%, while o3's highly efficient version scored only 12% lower. I'm writing it on my blog. The high-scoring version of o3 used over $10,000 of resources to complete the test. That makes it too expensive to compete for the ARC Prize, an undefeated competition for AI models to win the ARC test.
But Chollet says o3 was nevertheless a breakthrough in AI models.
“o3 is a system that can adapt to tasks never encountered before, perhaps approaching human-level performance in the ARC-AGI domain,” Chollet said in a blog post. “Of course, such versatility comes at a huge cost and is still not completely economical. The money we pay a human to solve an ARC-AGI task is around $5 per task (our (I know, we did it) and it costs only a few cents.
It's too early to determine the exact price of all this. We've seen AI model prices plummet over the last year, but OpenAI has yet to announce the actual price of o3. However, these prices indicate how much compute is required to even remotely break through the performance barriers set by today's leading AI models.
This raises several questions. What is o3 actually for? And how much more computing do we need to get more benefits in terms of inference using o4, o5, or what OpenAI calls the next inference model?
I don't think o3 or its successors will become anyone's “daily driver” like GPT-4o or Google search. These models use large amounts of computing to answer small questions throughout the day, such as “How can the Cleveland Browns make the playoffs in 2024?”
Rather, an AI model with scaled test time calculations seems only suitable for big-picture prompts like “How can the Cleveland Browns become a Super Bowl franchise in 2027?” Still, the higher computing costs are probably only worth it if you're the general manager of the Cleveland Browns and use these tools to make big decisions.
As Wharton professor Ethan Mollick pointed out in a tweet, institutions with deep pockets may be the only ones able to take advantage of o3, at least initially.
O3 seems too expensive for most applications. But in academia, finance, and many industrial problem jobs, paying hundreds or even thousands of dollars to get a successful answer is not prohibitive. If it's generally reliable, there will be multiple use cases for o3 even before costs come down.
— Ethan Mollick (@emollick) December 22, 2024
It has already been confirmed that OpenAI has released a $200 tier to use the high-compute version of o1, but the startup is reportedly considering creating subscription plans up to $2,000. Masu. If you look at the amount of compute that o3 uses, you can understand why OpenAI would consider it.
However, there are drawbacks to using o3 for high-impact work. As Chollet points out, o3 is not AGI and will still fail at some very simple tasks that humans could easily perform.
This isn't necessarily surprising, since large language models still have big illusion problems that o3 and test-time computations don't seem to be able to solve. That's why ChatGPT and Gemini include a disclaimer below every answer they generate, asking users not to take the answers at face value. Perhaps, even if AGI were reached, such a disclaimer would not be necessary.
One way to further improve test time scaling could be better AI inference chips. There is no shortage of startups working on this, including Groq and Cerebras. Other startups are also designing more cost-effective AI chips, such as MatX. Anjney Midha, general partner at Andreessen Horowitz, previously told TechCrunch that he expects these startups to play a bigger role in expanding testing time going forward.
Although o3 has significantly improved the performance of AI models, it has raised some new questions regarding usage and cost. That being said, o3's performance lends credence to the argument that test-time compute is the next best way to scale AI models in the tech industry.