Over the past few months, tech executives like Elon Musk have been touting the performance of their AI models in one particular benchmark: the chatbot arena.
Chatbot Arena, run by a non-profit organization called LMSYS, has become a big focus for the industry: posts about model leaderboard updates are viewed and reshared hundreds of times on Reddit and X, and the official LMSYS X account has over 54,000 followers. The organization's website has seen millions of visitors in the last year alone.
Still, the question remains as to whether Chatbot Arena can tell us how “good” these models actually are.
Searching for a new benchmark
Before we dive into the details, let’s understand what LMSYS exactly is and why it has become so popular.
The nonprofit was launched just last April as a project spearheaded by students and faculty from Carnegie Mellon University, UC Berkeley's Skylab, and UC San Diego. Some of the founding members now work for Google DeepMind, Musk's xAI, and Nvidia. Today, LMSYS is primarily run by researchers affiliated with Skylab.
LMSYS wasn't trying to create a trendy model leaderboard — the group's original mission was to collaboratively develop and open-source models (specifically generative models like OpenAI's ChatGPT) to make them more accessible. But soon after LMSYS was founded, researchers who were dissatisfied with the status quo in AI benchmarking saw the value in building their own testing tools.
“Current benchmarks are not adequate for cutting-edge needs.” [models]”This is a crucial challenge, especially in assessing user preferences,” the researchers wrote in a technical paper published in March. “Therefore, there is an urgent need for an open, live evaluation platform based on human preferences that can more accurately reflect real-world usage.”
In fact, as I've written before, the most commonly used benchmarks today do a poor job of capturing how the average person interacts with models: many of the skills they explore (e.g., solving PhD-level math problems) are largely irrelevant to the vast majority of people using, say, Claude.
LMSYS developers had similar thoughts, so they came up with an alternative: Chatbot Arena, a crowdsourced benchmark designed to capture the “nuanced” aspects of models and their performance in open-ended, real-world tasks.
Chatbot Arena rankings as of early September 2024. Image courtesy of LMSYS
Chatbot Arena allows anyone on the web to ask two randomly selected anonymous models a question. After agreeing to the terms of use that allow LMSYS to use your data for future research, models, and related projects, you can vote for your favorite answer from the two head-to-head models (you can also declare a tie or say “both are bad”), at which point the identities of the models are revealed.
Chatbot Arena interface. Image courtesy of LMSYS
This flow generates a “diverse set of questions” that a typical user would ask the generative model, the researchers wrote in their March paper. “Leveraging this data, we employ powerful statistical methods. […] “The goal is to estimate model rankings in the most reliable and sample-efficient way possible,” they explained.
Since the launch of Chatbot Arena, LMSYS has added dozens of open models to the testing tool and partnered with universities such as Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and companies such as OpenAI, Google, Anthropic, Microsoft, Meta, Mistral and Hugging Face to make their models available for testing. Chatbot Arena currently features over 100 models, including multimodal models (models that can understand data other than text) such as OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet.
Over one million prompt-answer pairs were submitted and evaluated in this manner, generating a vast amount of ranking data.
Bias and lack of transparency
In a March paper, LMSYS founders argue that Chatbot Arena's user-submitted questions are “diverse enough” to serve as a benchmark for a range of AI use cases: “Due to its unique value and openness, Chatbot Arena has emerged as one of the most referenced model leaderboards,” they wrote.
But how beneficial are the results really? That's up for debate.
Yucheng Lin, a research scientist at the nonprofit Allen Institute for AI, said LMSYS hasn't been completely transparent about the capabilities, knowledge, and skills of the models it's evaluating on Chatbot Arena. LMSYS released a dataset called LMSYS-Chat-1M in March, which included 1 million conversations between users and 25 models on Chatbot Arena. But the dataset hasn't been updated since then.
“The assessment was not reproducible and the data released by LMSYS was limited, making it difficult to study the limitations of the model in detail,” Lin said.
Use the Chatbot Arena tool to compare the two models. Image credit: LMSYS
To the extent that LMSYS has detailed its testing methodology, the company's researchers wrote in a March paper that it leverages “efficient sampling algorithms” to pit models against each other in a way that “accelerates convergence of rankings while maintaining statistical validity.” LMSYS collects roughly 8,000 votes per model before updating the rankings on Chatbot Arena, and that threshold is usually reached after a few days, the researchers wrote.
But Lin feels that polls are unreliable because they don't take into account people's ability (or inability) to see through illusions from the model, or differences in preferences — for example, some users might prefer longer Markdown-formatted answers, while others prefer more concise answers.
The conclusion is that two users can give opposite answers to the same answer pair and both are equally valid, which fundamentally calls into question the value of the approach. Only recently did LMSYS experiment with controlling the “style” and “content” of the model's responses in Chatbot Arena.
“The collected human preference data does not take these subtle biases into account, and the platform does not distinguish between 'A is significantly better than B' and 'A is slightly better than B,'” Lin says. “Post-processing can mitigate some of these biases, but raw human preference data remains noisy.”
Mike Cook, a research fellow specializing in AI and game design at Queen Mary, University of London, agreed with Lin's assessment: “If I had run Chatbot Arena in 1998 and talked about dramatic shifts in the rankings and powerful chatbots, it would have been terrible,” he added, noting that while Chatbot Arena is framed as an empirical test, it is only a relative evaluation of models.
A more problematic bias for Chatbot Arena is the composition of its current user base.
The benchmark spread mostly by word of mouth within AI and tech industry circles, so it's unlikely it attracted a highly representative population, Lin says. Supporting his theory, the most common questions in the LMSYS-Chat-1M dataset were about programming, AI tools, software bugs and fixes, and app design—the kinds of questions you'd never expect a non-technical person to ask.
“The distribution of test data may not accurately reflect real human users in the target market,” Lin said. “Furthermore, the platform's evaluation process is out of control, as it mainly relies on post-processing to attach various tags to each query and create task-specific ratings. This approach lacks systematic rigor and makes it difficult to evaluate complex inference problems based solely on human preferences.”
Testing multimodal models in Chatbot Arena. Image credit: LMSYS
Cook noted that users in the chatbot arena are self-selecting and may not be as keen to stress test or push a model to its limits because they are interested in testing the model in the first place.
“In general, this is not a good way to conduct research,” Cook says. “Evaluators ask questions and vote on which model is 'better,' but 'better' is not actually defined by LMSYS. Doing very well on this benchmark might lead you to believe that the winning AI chatbot is more human, more accurate, more secure, more trustworthy, etc., but it actually means none of those things.”
LMSYS attempts to balance these biases by using automated systems (MT-Bench and Arena-Hard-Auto) that use the models themselves (OpenAI's GPT-4 and GPT-4 Turbo) to rank the quality of responses from other models (LMSYS publishes these rankings along with the votes). But while LMSYS claims that its models “match well to both controlled and crowdsourced human preferences,” the problem remains.
Commercial ties and data sharing
Lin said LMSYS's expanding commercial ties are another reason to take the rankings with a pinch of salt.
Some vendors that offer models through APIs, such as OpenAI, have access to model usage data that can be used to “teach tests” as needed. This could make the testing process unfair to the open, static models that run on LMSYS' own cloud, Lin said.
“Companies can continually optimize their models to match LMSYS's user distribution, which can lead to unfair competition and meaningless valuations,” he added. “Commercial models connected via APIs have access to all user input data, giving an advantage to companies with high traffic.”
Cook added, “What LMSYS does is not encourage novel AI research or anything like that, but rather encourages developers to tweak the little details to get a phrasing advantage over their competitors.”
LMSYS is also sponsored in part by organizations, including venture capital firms, that are in the AI race.
LMSYS Corporate Sponsorship. Image courtesy of LMSYS
Google's Kaggle data science platform has contributed funding to LMSYS, as has Andreessen Horowitz (which also invests in Mistral) and Together AI. Google's Gemini model is featured on the Chatbot Arena, as are Mistral and Together.
LMSYS says on its website that it also relies on grants and donations from universities to support its infrastructure, and that its sponsorships, which come in the form of cash as well as hardware and cloud computing credits, come with “no strings attached.” But given that vendors are increasingly using Chatbot Arena to drive excitement for their own models, the relationship gives the impression that LMSYS isn't being entirely impartial.
LMSYS did not respond to TechCrunch's request for an interview.
A better benchmark?
Despite their shortcomings, Lin believes LMSYS and Chatbot Arena provide a valuable service by providing real-time insight into how different models perform outside the lab.
“Chatbot Arena goes beyond traditional approaches to optimizing multiple-choice benchmarks, which are often saturated and not directly applicable to real-world scenarios,” said Lin. “The benchmark provides a unified platform where real users can interact with multiple models, providing a more dynamic and realistic evaluation.”
But as LMSYS continues to add features such as automated assessment to Chatbot Arena, Lin feels there are simple challenges the organization can tackle to improve testing.
To get a more “systematic” understanding of the strengths and weaknesses of its models, he argues, LMSYS could design benchmarks around different subtopics, such as linear algebra, each with a set of domain-specific tasks. That would give Chatbot Arena's results much more scientific weight, he says.
“Chatbot Arena can provide a snapshot of user experience, even from a small and potentially unrepresentative user base, but it shouldn't be seen as a definitive standard for measuring a model's intelligence,” Lin says. “Rather, it's better viewed as a tool to measure user satisfaction, rather than a scientific and objective measure of AI progress.”