Close Menu
TechBrunchTechBrunch
  • Home
  • AI
  • Apps
  • Crypto
  • Security
  • Startups
  • TechCrunch
  • Venture

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Sri Mandir keeps investors hooked as digital dedication grows

July 1, 2025

Instagram lets you share Spotify songs with your story to your sound

June 30, 2025

At every stage of TechCrunch, Charles Hudson tells us what investors really see

June 30, 2025
Facebook X (Twitter) Instagram
TechBrunchTechBrunch
  • Home
  • AI

    OpenAI seeks to extend human lifespans with the help of longevity startups

    January 17, 2025

    Farewell to the $200 million woolly mammoth and TikTok

    January 17, 2025

    Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

    January 17, 2025

    Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

    January 16, 2025

    Apple suspends AI notification summaries for news after generating false alerts

    January 16, 2025
  • Apps

    Sri Mandir keeps investors hooked as digital dedication grows

    July 1, 2025

    Instagram lets you share Spotify songs with your story to your sound

    June 30, 2025

    The best iPad app to unleash and explore your creativity

    June 30, 2025

    Privacy-centric app maker Proton sues Apple over anti-competitive practices and charges alleged

    June 30, 2025

    Google is adopting AI in classrooms, including new Gemini tools for educators and chatbots for students

    June 30, 2025
  • Crypto

    Vitalik Buterin reserves for Sam Altman's global project

    June 28, 2025

    Calci will close a $185 million round as rival Polymeruk reportedly seeks $200 million

    June 25, 2025

    Stablecoin Evangelist: Katie Haun's Battle of Digital Dollars

    June 22, 2025

    Hackers steal and destroy millions of Iran's biggest crypto exchanges

    June 18, 2025

    Unique, a new social media app

    June 17, 2025
  • Security

    US government overthrows North Korea's major “workers” management

    June 30, 2025

    Mexican drug cartel hackers spy on FBI officials' phones to track and kill informants, the report says

    June 30, 2025

    FBI, cybersecurity firms say prolific hacking crews are currently targeting airlines and transportation sectors

    June 28, 2025

    Prolific cybercrime gangs currently targeting the airline and transportation sector

    June 27, 2025

    US and French authorities confirm arrest of a violation form hacker

    June 26, 2025
  • Startups

    7 days left: Founders and VCs save over $300 on all stage passes

    March 24, 2025

    AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

    March 24, 2025

    20 Hottest Open Source Startups of 2024

    March 22, 2025

    Andrill may build a weapons factory in the UK

    March 21, 2025

    Startup Weekly: Wiz bets paid off at M&A Rich Week

    March 21, 2025
  • TechCrunch

    OpenSea takes a long-term view with a focus on UX despite NFT sales remaining low

    February 8, 2024

    AI will save software companies' growth dreams

    February 8, 2024

    B2B and B2C are not about who buys, but how you sell

    February 5, 2024

    It's time for venture capital to break away from fast fashion

    February 3, 2024

    a16z's Chris Dixon believes it's time to focus on blockchain use cases rather than speculation

    February 2, 2024
  • Venture

    At every stage of TechCrunch, Charles Hudson tells us what investors really see

    June 30, 2025

    From $5 to Financial Empowerment: Why Stash co-founder Brandon Krieg is a must-see for TechCrunch All Stage 2025

    June 30, 2025

    A comprehensive list of 2025 tech layoffs

    June 30, 2025

    How to prepare for a second semester salary increase now live in 2025

    June 30, 2025

    Tiffany is lucky to have won a VCS at TC at every stage.

    June 30, 2025
TechBrunchTechBrunch

AI training data comes at a price only big tech companies can afford

TechBrunchBy TechBrunchJune 1, 20248 Mins Read
Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
Share
Facebook Twitter LinkedIn Pinterest Telegram Email


Data is at the heart of today's advanced AI systems, but it is becoming increasingly expensive and out of reach for all but the wealthiest tech companies.

Last year, OpenAI researcher James Betker wrote a post on his personal blog about the nature of generative AI models and the datasets they are trained on, arguing that the training data, rather than the model design, architecture, or other characteristics, is the key to enabling increasingly sophisticated and performant AI systems.

“If you train on the same dataset for long enough, nearly all models will converge to the same point,” Betker wrote.

Is Betker right? Is training data the biggest factor in determining what a model can do, like answer a question, draw a human hand, or generate a realistic cityscape?

That is certainly plausible.

Statistical Machine

Generative AI systems are essentially probabilistic models – huge collections of statistics – that infer, based on huge numbers of examples, what data “makes the most sense” to place where (for example, the word “go” before “to the market” in the sentence “I go to the market”). So it makes intuitive sense that the more examples a model has to use, the better a model trained on those examples will perform.

“The performance improvements seem to come from the data,” Kyle Lo, senior applied research scientist at the Allen Institute for AI (AI2), an AI research nonprofit, told TechCrunch, “at least once you have a stable training environment.”

Lo gave the example of Meta's Llama 3 text generation model, released earlier this year, which outperforms AI2's own OLMo model, despite being architecturally very similar. Llama 3 was trained on much more data than OLMo, which Lo believes is why it outperforms in many common AI benchmarks.

(It’s worth pointing out here that the benchmarks currently in wide use in the AI ​​industry aren’t necessarily the best measures of model performance, but outside of qualitative tests like ours, they are one of the few measures we can rely on.)

This is not to suggest that training on exponentially larger datasets is a surefire path to exponentially better models: Lo points out that models operate on a “garbage in, garbage out” paradigm, so data curation and quality are probably much more important than quantity itself.

“It's possible that a smaller model with carefully designed data can outperform a larger model,” he added. “For example, the larger Falcon 180B is ranked 63rd in the LMSYS benchmark, while the much smaller Llama 2 13B is ranked 56th.”

In an interview with TechCrunch last October, OpenAI researcher Gabriel Goh said that higher quality annotations were a big factor in why OpenAI's text-to-image model, DALL-E 3, produced better image quality than its predecessor, DALL-E 2. “I think that's the main source of improvement,” he said. “Text annotations are a lot better now than they were before. [with DALL-E 2] “You can't even compare it.”

Many AI models, including DALL-E 3 and DALL-E 2, are trained by human annotators who label data so that the model can learn to associate those labels with other observed characteristics of that data. For example, a model fed a large number of photos of cats, each with annotated breeds, will eventually “learn” to associate terms like bobtail and shorthair with their distinctive visual features.

Bad behavior

Experts like Lo worry that the increasing emphasis on large, high-quality training datasets will concentrate AI development in the hands of a few companies with billion-dollar budgets that can afford them. Synthetic data or major innovations in underlying architectures could be game-changers, but neither seem likely in the near future.

“Across the board, organizations that control content that could be useful for AI development have an incentive to lock down that material,” Lo says. “And as access to the data closes, we're essentially congratulating a few pioneers in data acquisition and raising the ladder so that no one else can access the data and catch up.”

Indeed, when the race to collect more training data isn't leading to unethical (and possibly illegal) practices like the secret collection of copyrighted content, it's benefiting tech giants with deep pockets to spend on data licenses.

Generative AI models, such as those from OpenAI, are primarily trained on images, text, audio, video, and other data (some of which is copyrighted) taken from public web pages (and AI-generated pages are problematic as well). OpenAIs around the world argue that fair use protects them from legal retaliation. Many rights holders disagree, but at least for now, there isn't much that can be done to stop the practice.

There are numerous examples of generative AI vendors using questionable means to obtain huge datasets to train their models: OpenAI reportedly transcribed more than a million hours of YouTube videos to feed into its flagship model, GPT-4, without YouTube's permission or that of the creators. Google recently expanded some of its terms of service to allow its AI products to use publicly available Google Docs, restaurant reviews on Google Maps, and other online material, and Meta is said to have considered training its models on IP-protected content, even at the risk of litigation.

Meanwhile, companies large and small rely on Third World workers who make just a few dollars an hour to annotate their training sets. Some of these annotators, employed by huge startups like Scale AI, literally work days without benefits or guarantee of future work to complete tasks that expose them to graphic images of violence and gore.

Rising costs

In other words, even fairer data trade does not foster an open and equitable generative AI ecosystem.

OpenAI has spent hundreds of millions of dollars licensing content from news publishers, stock media libraries and others to train its AI models — far more than most academic research groups, nonprofits or startups can afford. Meta even considered buying publisher Simon & Schuster for the rights to e-book excerpts (it was eventually sold to private equity firm KKR for $1.62 billion in 2023).

The market for AI training data is expected to grow from about $2.5 billion today to nearly $30 billion within a decade, and data brokers and platforms are rushing to charge top dollar, in some cases over the objections of their user base.

Stock media library Shutterstock has signed deals with AI vendors ranging from $25 million to $50 million, and Reddit claims to have made hundreds of millions of dollars licensing data to organizations like Google and OpenAI. It appears that there are few platforms with a wealth of data accumulated organically over the years, from Photobucket to Tumblr to Q&A site Stack Overflow, that haven't struck deals with generative AI developers.

The data is what the platforms sell — or at least that's what it is, depending on which legal argument you believe — but in most cases, users don't get a penny of the benefits, and that's having a negative impact on the entire AI research community.

“Small businesses will not be able to afford these data licenses and will therefore be unable to develop and research AI models,” Lo said. “We are concerned that this will lead to a lack of independent oversight of AI development practices.”

Independent Initiative

If there is one ray of light in the darkness, it is a handful of independent, non-profit efforts to create massive datasets that anyone can use to train generative AI models.

EleutherAI, a grassroots non-profit research group that began as a loose Discord collective in 2020, is collaborating with the University of Toronto, AI2, and independent researchers to create The Pile v2, a set of billions of text sentences collected primarily from the public domain.

In April, AI startup Hugging Face released FineWeb, a filtered version of Common Crawl, a dataset of the same name maintained by the nonprofit organization Common Crawl. The dataset consists of billions of web pages, which Hugging Face claims will improve the performance of its models on a number of benchmarks.

There are some efforts to make training datasets publicly available, such as the LAION group's image set, but they face copyright, data privacy, and other equally serious ethical and legal challenges. However, some of the more dedicated data curators are promising improvements. For example, The Pile v2 has removed problematic copyrighted material found in The Pile, the dataset on which it was based.

The question is whether these open efforts can keep pace with big tech companies: As long as collecting and organizing data is a resource issue, the answer is probably no—at least until some research breakthrough levels the playing field.



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

OpenAI seeks to extend human lifespans with the help of longevity startups

January 17, 2025

Farewell to the $200 million woolly mammoth and TikTok

January 17, 2025

Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

January 17, 2025

Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

January 16, 2025

Apple suspends AI notification summaries for news after generating false alerts

January 16, 2025

Nvidia releases more tools and guardrails to help enterprises adopt AI agents

January 16, 2025

Leave A Reply Cancel Reply

Top Reviews
Editors Picks

7 days left: Founders and VCs save over $300 on all stage passes

March 24, 2025

AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

March 24, 2025

20 Hottest Open Source Startups of 2024

March 22, 2025

Andrill may build a weapons factory in the UK

March 21, 2025
About Us
About Us

Welcome to Tech Brunch, your go-to destination for cutting-edge insights, news, and analysis in the fields of Artificial Intelligence (AI), Cryptocurrency, Technology, and Startups. At Tech Brunch, we are passionate about exploring the latest trends, innovations, and developments shaping the future of these dynamic industries.

Our Picks

Sri Mandir keeps investors hooked as digital dedication grows

July 1, 2025

Instagram lets you share Spotify songs with your story to your sound

June 30, 2025

At every stage of TechCrunch, Charles Hudson tells us what investors really see

June 30, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

© 2025 TechBrunch. Designed by TechBrunch.
  • Home
  • About Tech Brunch
  • Advertise with Tech Brunch
  • Contact us
  • DMCA Notice
  • Privacy Policy
  • Terms of Use

Type above and press Enter to search. Press Esc to cancel.