Close Menu
TechBrunchTechBrunch
  • Home
  • AI
  • Apps
  • Crypto
  • Security
  • Startups
  • TechCrunch
  • Venture

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Now, deals accusing customers of “spoofing” spies

June 3, 2025

Phone Chipmaker Qualcomm fixes 3 zero-days exploited by hackers

June 3, 2025

Hinge CMO Jackie Juntos wants to help Zers Gen Zers become lonely

June 3, 2025
Facebook X (Twitter) Instagram
TechBrunchTechBrunch
  • Home
  • AI

    OpenAI seeks to extend human lifespans with the help of longevity startups

    January 17, 2025

    Farewell to the $200 million woolly mammoth and TikTok

    January 17, 2025

    Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

    January 17, 2025

    Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

    January 16, 2025

    Apple suspends AI notification summaries for news after generating false alerts

    January 16, 2025
  • Apps

    Hinge CMO Jackie Juntos wants to help Zers Gen Zers become lonely

    June 3, 2025

    Reddit allows you to hide content such as posts and comments from your user profile

    June 3, 2025

    Tiktok deploys AI-powered smart keyword filters to limit content you don't want to see

    June 3, 2025

    SNAP adds new tools to build Bitmoji games

    June 3, 2025

    iOS 19: All the rumor changes that Apple could bring to the new operating system

    June 3, 2025
  • Crypto

    GameStop bought $500 million in Bitcoin

    May 28, 2025

    Vote for the session you want to watch in 2025

    May 26, 2025

    Save $900 + 90% from 2 tickets to destroy 2025 in the last 24 hours

    May 25, 2025

    Only 3 days left to save up to $900 to destroy the 2025 pass

    May 23, 2025

    Starting from up to $900 from Ticep, 90% off +1 in 2025

    May 22, 2025
  • Security

    Phone Chipmaker Qualcomm fixes 3 zero-days exploited by hackers

    June 3, 2025

    Indian grocery startup Kiranapro has been hacked and its server has been removed, CEO confirms

    June 3, 2025

    Health Giant Kettering is still facing chaos after a few weeks of ransomware attack

    June 3, 2025

    Vanta Bug has made its customer data public to other customers

    June 2, 2025

    NSO Group calls the judge for a new trial, calling $167 million in damages “outrageous”

    June 2, 2025
  • Startups

    7 days left: Founders and VCs save over $300 on all stage passes

    March 24, 2025

    AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

    March 24, 2025

    20 Hottest Open Source Startups of 2024

    March 22, 2025

    Andrill may build a weapons factory in the UK

    March 21, 2025

    Startup Weekly: Wiz bets paid off at M&A Rich Week

    March 21, 2025
  • TechCrunch

    OpenSea takes a long-term view with a focus on UX despite NFT sales remaining low

    February 8, 2024

    AI will save software companies' growth dreams

    February 8, 2024

    B2B and B2C are not about who buys, but how you sell

    February 5, 2024

    It's time for venture capital to break away from fast fashion

    February 3, 2024

    a16z's Chris Dixon believes it's time to focus on blockchain use cases rather than speculation

    February 2, 2024
  • Venture

    Now, deals accusing customers of “spoofing” spies

    June 3, 2025

    The week left to boost your brand with side events at TC on all stages

    June 3, 2025

    Learn how Toyota and NLX succeeded in the TC session: AI

    June 3, 2025

    TC Session: AI Trivia Countdown – 1 Ticket Score

    June 3, 2025

    2 days until the TC session: UC Berkeley's AI

    June 3, 2025
TechBrunchTechBrunch

Research suggests that even the best AI models hallucinate

TechBrunchBy TechBrunchAugust 14, 20245 Mins Read
Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
Share
Facebook Twitter LinkedIn Pinterest Telegram Email


All generative AI models, from Google’s Gemini to Anthropic’s Claude to the latest stealth release of OpenAI’s GPT-4o, hallucinate. In other words, the models are unreliable narrators, sometimes to hilarious effect, sometimes to problematic effect.

But not all models lie at the same rate, and the types of falsehoods they spit vary depending on the sources they've been exposed to.

A recent study by researchers from Cornell University, the University of Washington, the University of Waterloo, and the nonprofit research institute AI2 attempted to benchmark models like GPT-4o against authoritative sources on a wide range of topics, from law and health to history and geography. They found that no model performed exceptionally well on all topics, and that the models that hallucinated the least did so because they avoided answering questions that they would normally get wrong.

“The most important lesson from our work is that we still cannot fully trust the output of model generation,” Wenting Zhao, a doctoral student at Cornell University and co-author of the study, told TechCrunch. “Right now, even the best models can only generate hallucination-free text about 35% of the time.”

There have been other academic attempts to explore the “factuality” of models, including by other AI2-related teams, but Zhao points out that these early tests asked the models questions that could easily be answered on Wikipedia — not particularly difficult questions, given that most models are trained on Wikipedia data.

To make the benchmark more challenging and to more accurately reflect the types of questions people ask the model, the researchers identified topics on the web that have no Wikipedia references: Over half of the test questions could not be answered on Wikipedia (though they did include some questions taken from Wikipedia just to be safe), and touched on topics such as culture, geography, astronomy, pop culture, finance, medicine, computer science, and celebrity.

For their study, the researchers evaluated more than a dozen popular models, including those released in the past year. In addition to GPT-4o, they also tested “open” models like Meta's Llama 3 70B, Mistral's Mixtral 8x22B, and Cohere's Command R+, as well as gated-API models like Perplexity's Sonar Large (based on Llama), Google's Gemini 1.5 Pro, and Anthropic's Claude 3 Opus.

The results suggest that, despite claims to the contrary from OpenAI, Anthropic and other major generative AI companies, recent models are not hallucinating, much less stumbling.

GPT-4o and OpenAI's much older flagship GPT-3.5 performed roughly the same in terms of the percentage of questions answered correctly based on facts in the benchmark (GPT-4o did slightly better). OpenAI's model was the least hallucinatory overall, followed by Mixtral 8x22B, Command R, and Perplexity's Sonar model.

Questions about celebrities and finance were the most difficult for the models, while questions about geography and computer science were the easiest for the models to answer (probably because the training data contained many references to them). When the answer source was not Wikipedia, all models gave less factual answers on average (especially GPT-3.5 and GPT-4o), suggesting that all models were heavily influenced by Wikipedia content.

Even models that can search the web for information, like Command R and Perplexity's Sonar model, struggled with “non-Wiki” questions in the benchmark. Model size didn't matter much, with smaller models (e.g., Anthropic's Claude 3 Haiku) hallucinating about as often as larger, apparently more capable models (e.g., Claude 3 Opus).

So what does this mean, and where are the improvements that vendors promise?

Well, vendors are right to overstate their claims, but a more charitable view is that the benchmarks they use are not suited to this purpose. As I've written before, many, if not most, AI evaluations are episodic and lack important context, making them destined to fall victim to Goodhart's Law.

Either way, Zhao said he expects the hallucinations problem “will continue for a long time.”

“The experimental results in our paper show that, despite the promise of certain methods to reduce or eliminate hallucinations, there are limits to the improvements these methods can achieve in practice,” she said. “Furthermore, our analysis reveals that even knowledge found on the internet is often contradictory, in part because human-created training data may also contain hallucinations.”

An interim solution would be to simply program the model to refuse to answer more often, which is the technical equivalent of telling a know-it-all to “fuck off.”

In the researchers' tests, the Claude 3 Haiku only answered about 72 percent of the questions asked, choosing not to answer the rest. Taking into account abstentions, the Claude 3 Haiku was actually the most factual of all the models, at least in the sense that it lied the least frequently.

But will people use a model that doesn't answer many questions? Zhao doesn't think so, and says vendors should put more time and effort into researching ways to mitigate hallucinations. While hallucinations may not be possible to eliminate entirely, she argues they can be mitigated by human fact-checking and citations during model development.

“Policies and regulations need to be developed to ensure that human experts are always involved in the process of verifying and validating information produced by generative AI models,” Zhao added. “There are still many opportunities in this field to make a significant impact, such as developing advanced fact-checking tools for any free text, providing citations for factual content, and offering corrections to hallucinated text.”



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

OpenAI seeks to extend human lifespans with the help of longevity startups

January 17, 2025

Farewell to the $200 million woolly mammoth and TikTok

January 17, 2025

Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

January 17, 2025

Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

January 16, 2025

Apple suspends AI notification summaries for news after generating false alerts

January 16, 2025

Nvidia releases more tools and guardrails to help enterprises adopt AI agents

January 16, 2025

Leave A Reply Cancel Reply

Top Reviews
Editors Picks

7 days left: Founders and VCs save over $300 on all stage passes

March 24, 2025

AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

March 24, 2025

20 Hottest Open Source Startups of 2024

March 22, 2025

Andrill may build a weapons factory in the UK

March 21, 2025
About Us
About Us

Welcome to Tech Brunch, your go-to destination for cutting-edge insights, news, and analysis in the fields of Artificial Intelligence (AI), Cryptocurrency, Technology, and Startups. At Tech Brunch, we are passionate about exploring the latest trends, innovations, and developments shaping the future of these dynamic industries.

Our Picks

Now, deals accusing customers of “spoofing” spies

June 3, 2025

Phone Chipmaker Qualcomm fixes 3 zero-days exploited by hackers

June 3, 2025

Hinge CMO Jackie Juntos wants to help Zers Gen Zers become lonely

June 3, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

© 2025 TechBrunch. Designed by TechBrunch.
  • Home
  • About Tech Brunch
  • Advertise with Tech Brunch
  • Contact us
  • DMCA Notice
  • Privacy Policy
  • Terms of Use

Type above and press Enter to search. Press Esc to cancel.