When a company releases a new AI video generator, it doesn't take long before someone uses it to create a video of actor Will Smith eating spaghetti.
It has become something of a benchmark as well as a meme. The idea is to see if a new video generator can realistically render Smith slurping a bowl of noodles. Smith himself parodied the trend in an Instagram post in February.
Google Veo 2 has done just that.
It's time to eat spaghetti. pic.twitter.com/AZO81w8JC0
— Jerrod Lew (@jerrod_lew) December 17, 2024
Will Smith and Pasta are just one of several strange “unofficial” benchmarks that will sweep the AI community in 2024. A 16-year-old developer built an app that lets an AI control Minecraft and tests its ability to design structures. Elsewhere, British programmers have created a platform where AI plays games like Pictionary and Connect 4 against each other.
It's not that there aren't any more academic tests of AI performance. So why did the weirdos explode?
Image credit: Paul Calcraft
First, many industry-standard AI benchmarks are largely obscure to the general public. Companies often cite AI's ability to answer Math Olympiad exam questions or find plausible solutions to doctoral-level problems. But most people, including you, use chatbots for things like replying to emails and doing basic research.
Crowdsourced industry measures are not necessarily better or more beneficial.
For example, consider Chatbot Arena, a public benchmark that is avidly followed by many AI enthusiasts and developers. Chatbot Arena allows anyone on the web to evaluate how well AI performs at specific tasks, such as creating web apps or generating images. But evaluators tend to be unrepresentative, most of them from the AI and technology industry, and vote based on personal, specific, and elusive preferences.
Chatbot arena interface. Image credit: LMSYS
In a recent post on X, Wharton management professor Ethan Mollick pointed out another problem with many AI industry benchmarks. That's because you're not comparing the system's performance to the average person's performance.
“The fact that we don't have 30 different benchmarks from different organizations for medical, legal, quality of advice, etc. is a real shame because people are still using the system for these things. ” Mollick wrote.
Weird AI benchmarks like Connect 4, Minecraft, and Will Smith eating spaghetti are definitely not empirical or even generalizable. Just because an AI passes Will Smith's test doesn't mean it can successfully produce, say, a hamburger.
Be careful of typos. There is no model like the Claude 3.6 Sonnet. Image credit: Adonis Singh
One expert I spoke to about AI benchmarking suggested that the AI community is focused on the downstream impact of AI, rather than its capabilities in narrow areas. That's wise. But I have a feeling the strange benchmarks aren't going away anytime soon. It's not just funny. Who doesn't love watching an AI build a Minecraft castle? — but they are easy to understand. And, as my colleague Max Zeff recently wrote , the industry continues to grapple with distilling a technology as complex as AI into digestible marketing.
The only question in my mind is: What weird new benchmarks will be popular in 2025?