Close Menu
TechBrunchTechBrunch
  • Home
  • AI
  • Apps
  • Crypto
  • Security
  • Startups
  • TechCrunch
  • Venture

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Apple TV's TVOS26 gets “liquid glass” treatment and profile switching capabilities

June 9, 2025

Paragon says it has cancelled its contract with Italy over the government's refusal to investigate spyware attacks against journalists

June 9, 2025

Apple will redesign its operating system with “LiquidGlass”

June 9, 2025
Facebook X (Twitter) Instagram
TechBrunchTechBrunch
  • Home
  • AI

    OpenAI seeks to extend human lifespans with the help of longevity startups

    January 17, 2025

    Farewell to the $200 million woolly mammoth and TikTok

    January 17, 2025

    Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

    January 17, 2025

    Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

    January 16, 2025

    Apple suspends AI notification summaries for news after generating false alerts

    January 16, 2025
  • Apps

    Apple TV's TVOS26 gets “liquid glass” treatment and profile switching capabilities

    June 9, 2025

    Apple will redesign its operating system with “LiquidGlass”

    June 9, 2025

    iOS 19: All the rumor changes that Apple could bring to the new operating system

    June 9, 2025

    WWDC 2025 Live Update: fresh look of iOS, dedicated gaming apps, and more

    June 9, 2025

    Openai updates ChatGpt audio mode with more natural sound speech

    June 9, 2025
  • Crypto

    xNotify Polymarket as partner in the official forecast market

    June 6, 2025

    Circle IPOs are giving hope to more startups waiting to be published to more startups

    June 5, 2025

    GameStop bought $500 million in Bitcoin

    May 28, 2025

    Vote for the session you want to watch in 2025

    May 26, 2025

    Save $900 + 90% from 2 tickets to destroy 2025 in the last 24 hours

    May 25, 2025
  • Security

    Paragon says it has cancelled its contract with Italy over the government's refusal to investigate spyware attacks against journalists

    June 9, 2025

    Major US grocery distributors warn of chaos after cyberattacks

    June 9, 2025

    Google fixes bugs that could reveal users' private phone numbers

    June 9, 2025

    The Trump administration is aiming for Biden and Obama's cybersecurity rules

    June 7, 2025

    After data is wiped out, Kiranapro co-founders cannot rule out external hacks

    June 7, 2025
  • Startups

    7 days left: Founders and VCs save over $300 on all stage passes

    March 24, 2025

    AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

    March 24, 2025

    20 Hottest Open Source Startups of 2024

    March 22, 2025

    Andrill may build a weapons factory in the UK

    March 21, 2025

    Startup Weekly: Wiz bets paid off at M&A Rich Week

    March 21, 2025
  • TechCrunch

    OpenSea takes a long-term view with a focus on UX despite NFT sales remaining low

    February 8, 2024

    AI will save software companies' growth dreams

    February 8, 2024

    B2B and B2C are not about who buys, but how you sell

    February 5, 2024

    It's time for venture capital to break away from fast fashion

    February 3, 2024

    a16z's Chris Dixon believes it's time to focus on blockchain use cases rather than speculation

    February 2, 2024
  • Venture

    Investors are encouraged to fund gender, drugs and other social taboo products

    June 9, 2025

    Meta reportedly invests billions of dollars in scale AI

    June 8, 2025

    Why investing in a growing AI startup is risky and more complicated

    June 6, 2025

    Startup Battlefield 200: Only 3 days left

    June 6, 2025

    Book all TC Stage Exhibitor Tables before ending today

    June 6, 2025
TechBrunchTechBrunch

AI training data comes at a price only big tech companies can afford

TechBrunchBy TechBrunchJune 1, 20248 Mins Read
Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
Share
Facebook Twitter LinkedIn Pinterest Telegram Email


Data is at the heart of today's advanced AI systems, but it is becoming increasingly expensive and out of reach for all but the wealthiest tech companies.

Last year, OpenAI researcher James Betker wrote a post on his personal blog about the nature of generative AI models and the datasets they are trained on, arguing that the training data, rather than the model design, architecture, or other characteristics, is the key to enabling increasingly sophisticated and performant AI systems.

“If you train on the same dataset for long enough, nearly all models will converge to the same point,” Betker wrote.

Is Betker right? Is training data the biggest factor in determining what a model can do, like answer a question, draw a human hand, or generate a realistic cityscape?

That is certainly plausible.

Statistical Machine

Generative AI systems are essentially probabilistic models – huge collections of statistics – that infer, based on huge numbers of examples, what data “makes the most sense” to place where (for example, the word “go” before “to the market” in the sentence “I go to the market”). So it makes intuitive sense that the more examples a model has to use, the better a model trained on those examples will perform.

“The performance improvements seem to come from the data,” Kyle Lo, senior applied research scientist at the Allen Institute for AI (AI2), an AI research nonprofit, told TechCrunch, “at least once you have a stable training environment.”

Lo gave the example of Meta's Llama 3 text generation model, released earlier this year, which outperforms AI2's own OLMo model, despite being architecturally very similar. Llama 3 was trained on much more data than OLMo, which Lo believes is why it outperforms in many common AI benchmarks.

(It’s worth pointing out here that the benchmarks currently in wide use in the AI ​​industry aren’t necessarily the best measures of model performance, but outside of qualitative tests like ours, they are one of the few measures we can rely on.)

This is not to suggest that training on exponentially larger datasets is a surefire path to exponentially better models: Lo points out that models operate on a “garbage in, garbage out” paradigm, so data curation and quality are probably much more important than quantity itself.

“It's possible that a smaller model with carefully designed data can outperform a larger model,” he added. “For example, the larger Falcon 180B is ranked 63rd in the LMSYS benchmark, while the much smaller Llama 2 13B is ranked 56th.”

In an interview with TechCrunch last October, OpenAI researcher Gabriel Goh said that higher quality annotations were a big factor in why OpenAI's text-to-image model, DALL-E 3, produced better image quality than its predecessor, DALL-E 2. “I think that's the main source of improvement,” he said. “Text annotations are a lot better now than they were before. [with DALL-E 2] “You can't even compare it.”

Many AI models, including DALL-E 3 and DALL-E 2, are trained by human annotators who label data so that the model can learn to associate those labels with other observed characteristics of that data. For example, a model fed a large number of photos of cats, each with annotated breeds, will eventually “learn” to associate terms like bobtail and shorthair with their distinctive visual features.

Bad behavior

Experts like Lo worry that the increasing emphasis on large, high-quality training datasets will concentrate AI development in the hands of a few companies with billion-dollar budgets that can afford them. Synthetic data or major innovations in underlying architectures could be game-changers, but neither seem likely in the near future.

“Across the board, organizations that control content that could be useful for AI development have an incentive to lock down that material,” Lo says. “And as access to the data closes, we're essentially congratulating a few pioneers in data acquisition and raising the ladder so that no one else can access the data and catch up.”

Indeed, when the race to collect more training data isn't leading to unethical (and possibly illegal) practices like the secret collection of copyrighted content, it's benefiting tech giants with deep pockets to spend on data licenses.

Generative AI models, such as those from OpenAI, are primarily trained on images, text, audio, video, and other data (some of which is copyrighted) taken from public web pages (and AI-generated pages are problematic as well). OpenAIs around the world argue that fair use protects them from legal retaliation. Many rights holders disagree, but at least for now, there isn't much that can be done to stop the practice.

There are numerous examples of generative AI vendors using questionable means to obtain huge datasets to train their models: OpenAI reportedly transcribed more than a million hours of YouTube videos to feed into its flagship model, GPT-4, without YouTube's permission or that of the creators. Google recently expanded some of its terms of service to allow its AI products to use publicly available Google Docs, restaurant reviews on Google Maps, and other online material, and Meta is said to have considered training its models on IP-protected content, even at the risk of litigation.

Meanwhile, companies large and small rely on Third World workers who make just a few dollars an hour to annotate their training sets. Some of these annotators, employed by huge startups like Scale AI, literally work days without benefits or guarantee of future work to complete tasks that expose them to graphic images of violence and gore.

Rising costs

In other words, even fairer data trade does not foster an open and equitable generative AI ecosystem.

OpenAI has spent hundreds of millions of dollars licensing content from news publishers, stock media libraries and others to train its AI models — far more than most academic research groups, nonprofits or startups can afford. Meta even considered buying publisher Simon & Schuster for the rights to e-book excerpts (it was eventually sold to private equity firm KKR for $1.62 billion in 2023).

The market for AI training data is expected to grow from about $2.5 billion today to nearly $30 billion within a decade, and data brokers and platforms are rushing to charge top dollar, in some cases over the objections of their user base.

Stock media library Shutterstock has signed deals with AI vendors ranging from $25 million to $50 million, and Reddit claims to have made hundreds of millions of dollars licensing data to organizations like Google and OpenAI. It appears that there are few platforms with a wealth of data accumulated organically over the years, from Photobucket to Tumblr to Q&A site Stack Overflow, that haven't struck deals with generative AI developers.

The data is what the platforms sell — or at least that's what it is, depending on which legal argument you believe — but in most cases, users don't get a penny of the benefits, and that's having a negative impact on the entire AI research community.

“Small businesses will not be able to afford these data licenses and will therefore be unable to develop and research AI models,” Lo said. “We are concerned that this will lead to a lack of independent oversight of AI development practices.”

Independent Initiative

If there is one ray of light in the darkness, it is a handful of independent, non-profit efforts to create massive datasets that anyone can use to train generative AI models.

EleutherAI, a grassroots non-profit research group that began as a loose Discord collective in 2020, is collaborating with the University of Toronto, AI2, and independent researchers to create The Pile v2, a set of billions of text sentences collected primarily from the public domain.

In April, AI startup Hugging Face released FineWeb, a filtered version of Common Crawl, a dataset of the same name maintained by the nonprofit organization Common Crawl. The dataset consists of billions of web pages, which Hugging Face claims will improve the performance of its models on a number of benchmarks.

There are some efforts to make training datasets publicly available, such as the LAION group's image set, but they face copyright, data privacy, and other equally serious ethical and legal challenges. However, some of the more dedicated data curators are promising improvements. For example, The Pile v2 has removed problematic copyrighted material found in The Pile, the dataset on which it was based.

The question is whether these open efforts can keep pace with big tech companies: As long as collecting and organizing data is a resource issue, the answer is probably no—at least until some research breakthrough levels the playing field.



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

OpenAI seeks to extend human lifespans with the help of longevity startups

January 17, 2025

Farewell to the $200 million woolly mammoth and TikTok

January 17, 2025

Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

January 17, 2025

Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

January 16, 2025

Apple suspends AI notification summaries for news after generating false alerts

January 16, 2025

Nvidia releases more tools and guardrails to help enterprises adopt AI agents

January 16, 2025

Leave A Reply Cancel Reply

Top Reviews
Editors Picks

7 days left: Founders and VCs save over $300 on all stage passes

March 24, 2025

AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

March 24, 2025

20 Hottest Open Source Startups of 2024

March 22, 2025

Andrill may build a weapons factory in the UK

March 21, 2025
About Us
About Us

Welcome to Tech Brunch, your go-to destination for cutting-edge insights, news, and analysis in the fields of Artificial Intelligence (AI), Cryptocurrency, Technology, and Startups. At Tech Brunch, we are passionate about exploring the latest trends, innovations, and developments shaping the future of these dynamic industries.

Our Picks

Apple TV's TVOS26 gets “liquid glass” treatment and profile switching capabilities

June 9, 2025

Paragon says it has cancelled its contract with Italy over the government's refusal to investigate spyware attacks against journalists

June 9, 2025

Apple will redesign its operating system with “LiquidGlass”

June 9, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

© 2025 TechBrunch. Designed by TechBrunch.
  • Home
  • About Tech Brunch
  • Advertise with Tech Brunch
  • Contact us
  • DMCA Notice
  • Privacy Policy
  • Terms of Use

Type above and press Enter to search. Press Esc to cancel.