All generative AI models, from Google’s Gemini to Anthropic’s Claude to the latest stealth release of OpenAI’s GPT-4o, hallucinate. In other words, the models are unreliable narrators, sometimes to hilarious effect, sometimes to problematic effect.
But not all models lie at the same rate, and the types of falsehoods they spit vary depending on the sources they've been exposed to.
A recent study by researchers from Cornell University, the University of Washington, the University of Waterloo, and the nonprofit research institute AI2 attempted to benchmark models like GPT-4o against authoritative sources on a wide range of topics, from law and health to history and geography. They found that no model performed exceptionally well on all topics, and that the models that hallucinated the least did so because they avoided answering questions that they would normally get wrong.
“The most important lesson from our work is that we still cannot fully trust the output of model generation,” Wenting Zhao, a doctoral student at Cornell University and co-author of the study, told TechCrunch. “Right now, even the best models can only generate hallucination-free text about 35% of the time.”
There have been other academic attempts to explore the “factuality” of models, including by other AI2-related teams, but Zhao points out that these early tests asked the models questions that could easily be answered on Wikipedia — not particularly difficult questions, given that most models are trained on Wikipedia data.
To make the benchmark more challenging and to more accurately reflect the types of questions people ask the model, the researchers identified topics on the web that have no Wikipedia references: Over half of the test questions could not be answered on Wikipedia (though they did include some questions taken from Wikipedia just to be safe), and touched on topics such as culture, geography, astronomy, pop culture, finance, medicine, computer science, and celebrity.
For their study, the researchers evaluated more than a dozen popular models, including those released in the past year. In addition to GPT-4o, they also tested “open” models like Meta's Llama 3 70B, Mistral's Mixtral 8x22B, and Cohere's Command R+, as well as gated-API models like Perplexity's Sonar Large (based on Llama), Google's Gemini 1.5 Pro, and Anthropic's Claude 3 Opus.
The results suggest that, despite claims to the contrary from OpenAI, Anthropic and other major generative AI companies, recent models are not hallucinating, much less stumbling.
GPT-4o and OpenAI's much older flagship GPT-3.5 performed roughly the same in terms of the percentage of questions answered correctly based on facts in the benchmark (GPT-4o did slightly better). OpenAI's model was the least hallucinatory overall, followed by Mixtral 8x22B, Command R, and Perplexity's Sonar model.
Questions about celebrities and finance were the most difficult for the models, while questions about geography and computer science were the easiest for the models to answer (probably because the training data contained many references to them). When the answer source was not Wikipedia, all models gave less factual answers on average (especially GPT-3.5 and GPT-4o), suggesting that all models were heavily influenced by Wikipedia content.
Even models that can search the web for information, like Command R and Perplexity's Sonar model, struggled with “non-Wiki” questions in the benchmark. Model size didn't matter much, with smaller models (e.g., Anthropic's Claude 3 Haiku) hallucinating about as often as larger, apparently more capable models (e.g., Claude 3 Opus).
So what does this mean, and where are the improvements that vendors promise?
Well, vendors are right to overstate their claims, but a more charitable view is that the benchmarks they use are not suited to this purpose. As I've written before, many, if not most, AI evaluations are episodic and lack important context, making them destined to fall victim to Goodhart's Law.
Either way, Zhao said he expects the hallucinations problem “will continue for a long time.”
“The experimental results in our paper show that, despite the promise of certain methods to reduce or eliminate hallucinations, there are limits to the improvements these methods can achieve in practice,” she said. “Furthermore, our analysis reveals that even knowledge found on the internet is often contradictory, in part because human-created training data may also contain hallucinations.”
An interim solution would be to simply program the model to refuse to answer more often, which is the technical equivalent of telling a know-it-all to “fuck off.”
In the researchers' tests, the Claude 3 Haiku only answered about 72 percent of the questions asked, choosing not to answer the rest. Taking into account abstentions, the Claude 3 Haiku was actually the most factual of all the models, at least in the sense that it lied the least frequently.
But will people use a model that doesn't answer many questions? Zhao doesn't think so, and says vendors should put more time and effort into researching ways to mitigate hallucinations. While hallucinations may not be possible to eliminate entirely, she argues they can be mitigated by human fact-checking and citations during model development.
“Policies and regulations need to be developed to ensure that human experts are always involved in the process of verifying and validating information produced by generative AI models,” Zhao added. “There are still many opportunities in this field to make a significant impact, such as developing advanced fact-checking tools for any free text, providing citations for factual content, and offering corrections to hallucinated text.”