Close Menu
TechBrunchTechBrunch
  • Home
  • AI
  • Apps
  • Crypto
  • Security
  • Startups
  • TechCrunch
  • Venture

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Coinbase CEO explains why he fired an engineer who didn't try AI right away

August 22, 2025

BlueSky blocks Mississippi services across age guarantee laws

August 22, 2025

Tiktok denies India's comeback after reporting that the website has been published

August 22, 2025
Facebook X (Twitter) Instagram
TechBrunchTechBrunch
  • Home
  • AI

    OpenAI seeks to extend human lifespans with the help of longevity startups

    January 17, 2025

    Farewell to the $200 million woolly mammoth and TikTok

    January 17, 2025

    Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

    January 17, 2025

    Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

    January 16, 2025

    Apple suspends AI notification summaries for news after generating false alerts

    January 16, 2025
  • Apps

    BlueSky blocks Mississippi services across age guarantee laws

    August 22, 2025

    Tiktok denies India's comeback after reporting that the website has been published

    August 22, 2025

    Google makes it easier to edit drive videos with the new VIDS shortcut button

    August 22, 2025

    X brings out the ability to like and follow the free tier of developer APIs

    August 22, 2025

    As India bans real money games, dream sports, MPL starts pulling plugs

    August 21, 2025
  • Crypto

    Coinbase CEO explains why he fired an engineer who didn't try AI right away

    August 22, 2025

    Your next customer is destroying the 2025 Expo floor

    August 19, 2025

    Crypto Company Gemini File for Winklevoss Twins IPO

    August 16, 2025

    North Korean spies pretending to be remote workers have invaded hundreds of businesses, CloudStrike says

    August 4, 2025

    Telegram's Crypto Wallet will be released in the US

    July 22, 2025
  • Security

    Developers get prison time to disrupt the ex-employer's network with “kill switch”

    August 22, 2025

    Explain why hackers who exposed the North Korean government did that

    August 21, 2025

    Device searches at US borders hit record-breaking records, new data show

    August 20, 2025

    Listen and record all conversations “Always On” Harvard Dropout launches AI smart glasses

    August 20, 2025

    New Zero-Day startup offers $20 million for a tool that can hack your smartphone

    August 20, 2025
  • Startups

    7 days left: Founders and VCs save over $300 on all stage passes

    March 24, 2025

    AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

    March 24, 2025

    20 Hottest Open Source Startups of 2024

    March 22, 2025

    Andrill may build a weapons factory in the UK

    March 21, 2025

    Startup Weekly: Wiz bets paid off at M&A Rich Week

    March 21, 2025
  • TechCrunch

    OpenSea takes a long-term view with a focus on UX despite NFT sales remaining low

    February 8, 2024

    AI will save software companies' growth dreams

    February 8, 2024

    B2B and B2C are not about who buys, but how you sell

    February 5, 2024

    It's time for venture capital to break away from fast fashion

    February 3, 2024

    a16z's Chris Dixon believes it's time to focus on blockchain use cases rather than speculation

    February 2, 2024
  • Venture

    Y Combinator says Apple's App Store is hampering startup growth

    August 22, 2025

    Beanie baby in the brain rot era

    August 22, 2025

    Procuring multiple rounds of venture capital could be wrong for your startup

    August 21, 2025

    Strictlyvc at atrupt 2025: Inside the LP track

    August 21, 2025

    Even Rogers and Max Haot will take part in the Space Stage in 2025

    August 20, 2025
TechBrunchTechBrunch

Research suggests that even the best AI models hallucinate

TechBrunchBy TechBrunchAugust 14, 20245 Mins Read
Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
Share
Facebook Twitter LinkedIn Pinterest Telegram Email


All generative AI models, from Google’s Gemini to Anthropic’s Claude to the latest stealth release of OpenAI’s GPT-4o, hallucinate. In other words, the models are unreliable narrators, sometimes to hilarious effect, sometimes to problematic effect.

But not all models lie at the same rate, and the types of falsehoods they spit vary depending on the sources they've been exposed to.

A recent study by researchers from Cornell University, the University of Washington, the University of Waterloo, and the nonprofit research institute AI2 attempted to benchmark models like GPT-4o against authoritative sources on a wide range of topics, from law and health to history and geography. They found that no model performed exceptionally well on all topics, and that the models that hallucinated the least did so because they avoided answering questions that they would normally get wrong.

“The most important lesson from our work is that we still cannot fully trust the output of model generation,” Wenting Zhao, a doctoral student at Cornell University and co-author of the study, told TechCrunch. “Right now, even the best models can only generate hallucination-free text about 35% of the time.”

There have been other academic attempts to explore the “factuality” of models, including by other AI2-related teams, but Zhao points out that these early tests asked the models questions that could easily be answered on Wikipedia — not particularly difficult questions, given that most models are trained on Wikipedia data.

To make the benchmark more challenging and to more accurately reflect the types of questions people ask the model, the researchers identified topics on the web that have no Wikipedia references: Over half of the test questions could not be answered on Wikipedia (though they did include some questions taken from Wikipedia just to be safe), and touched on topics such as culture, geography, astronomy, pop culture, finance, medicine, computer science, and celebrity.

For their study, the researchers evaluated more than a dozen popular models, including those released in the past year. In addition to GPT-4o, they also tested “open” models like Meta's Llama 3 70B, Mistral's Mixtral 8x22B, and Cohere's Command R+, as well as gated-API models like Perplexity's Sonar Large (based on Llama), Google's Gemini 1.5 Pro, and Anthropic's Claude 3 Opus.

The results suggest that, despite claims to the contrary from OpenAI, Anthropic and other major generative AI companies, recent models are not hallucinating, much less stumbling.

GPT-4o and OpenAI's much older flagship GPT-3.5 performed roughly the same in terms of the percentage of questions answered correctly based on facts in the benchmark (GPT-4o did slightly better). OpenAI's model was the least hallucinatory overall, followed by Mixtral 8x22B, Command R, and Perplexity's Sonar model.

Questions about celebrities and finance were the most difficult for the models, while questions about geography and computer science were the easiest for the models to answer (probably because the training data contained many references to them). When the answer source was not Wikipedia, all models gave less factual answers on average (especially GPT-3.5 and GPT-4o), suggesting that all models were heavily influenced by Wikipedia content.

Even models that can search the web for information, like Command R and Perplexity's Sonar model, struggled with “non-Wiki” questions in the benchmark. Model size didn't matter much, with smaller models (e.g., Anthropic's Claude 3 Haiku) hallucinating about as often as larger, apparently more capable models (e.g., Claude 3 Opus).

So what does this mean, and where are the improvements that vendors promise?

Well, vendors are right to overstate their claims, but a more charitable view is that the benchmarks they use are not suited to this purpose. As I've written before, many, if not most, AI evaluations are episodic and lack important context, making them destined to fall victim to Goodhart's Law.

Either way, Zhao said he expects the hallucinations problem “will continue for a long time.”

“The experimental results in our paper show that, despite the promise of certain methods to reduce or eliminate hallucinations, there are limits to the improvements these methods can achieve in practice,” she said. “Furthermore, our analysis reveals that even knowledge found on the internet is often contradictory, in part because human-created training data may also contain hallucinations.”

An interim solution would be to simply program the model to refuse to answer more often, which is the technical equivalent of telling a know-it-all to “fuck off.”

In the researchers' tests, the Claude 3 Haiku only answered about 72 percent of the questions asked, choosing not to answer the rest. Taking into account abstentions, the Claude 3 Haiku was actually the most factual of all the models, at least in the sense that it lied the least frequently.

But will people use a model that doesn't answer many questions? Zhao doesn't think so, and says vendors should put more time and effort into researching ways to mitigate hallucinations. While hallucinations may not be possible to eliminate entirely, she argues they can be mitigated by human fact-checking and citations during model development.

“Policies and regulations need to be developed to ensure that human experts are always involved in the process of verifying and validating information produced by generative AI models,” Zhao added. “There are still many opportunities in this field to make a significant impact, such as developing advanced fact-checking tools for any free text, providing citations for factual content, and offering corrections to hallucinated text.”



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

OpenAI seeks to extend human lifespans with the help of longevity startups

January 17, 2025

Farewell to the $200 million woolly mammoth and TikTok

January 17, 2025

Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

January 17, 2025

Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

January 16, 2025

Apple suspends AI notification summaries for news after generating false alerts

January 16, 2025

Nvidia releases more tools and guardrails to help enterprises adopt AI agents

January 16, 2025

Leave A Reply Cancel Reply

Top Reviews
Editors Picks

7 days left: Founders and VCs save over $300 on all stage passes

March 24, 2025

AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

March 24, 2025

20 Hottest Open Source Startups of 2024

March 22, 2025

Andrill may build a weapons factory in the UK

March 21, 2025
About Us
About Us

Welcome to Tech Brunch, your go-to destination for cutting-edge insights, news, and analysis in the fields of Artificial Intelligence (AI), Cryptocurrency, Technology, and Startups. At Tech Brunch, we are passionate about exploring the latest trends, innovations, and developments shaping the future of these dynamic industries.

Our Picks

Coinbase CEO explains why he fired an engineer who didn't try AI right away

August 22, 2025

BlueSky blocks Mississippi services across age guarantee laws

August 22, 2025

Tiktok denies India's comeback after reporting that the website has been published

August 22, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

© 2025 TechBrunch. Designed by TechBrunch.
  • Home
  • About Tech Brunch
  • Advertise with Tech Brunch
  • Contact us
  • DMCA Notice
  • Privacy Policy
  • Terms of Use

Type above and press Enter to search. Press Esc to cancel.