Close Menu
TechBrunchTechBrunch
  • Home
  • AI
  • Apps
  • Crypto
  • Security
  • Startups
  • TechCrunch
  • Venture

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Build, not bind: Accel's Sonali de Rycker on European AI Crossroads

May 17, 2025

Google I/O 2025: What to expect including Gemini and Android 16 updates?

May 16, 2025

How Silicon Valley's influence in Washington benefits high-tech elites

May 16, 2025
Facebook X (Twitter) Instagram
TechBrunchTechBrunch
  • Home
  • AI

    OpenAI seeks to extend human lifespans with the help of longevity startups

    January 17, 2025

    Farewell to the $200 million woolly mammoth and TikTok

    January 17, 2025

    Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

    January 17, 2025

    Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

    January 16, 2025

    Apple suspends AI notification summaries for news after generating false alerts

    January 16, 2025
  • Apps

    Google I/O 2025: What to expect including Gemini and Android 16 updates?

    May 16, 2025

    After adding your own billing option to iOS, Apple asks Patreon to go to an external browser

    May 16, 2025

    The epic game says Apple is blocking Fortnite from the US and EU app stores

    May 16, 2025

    Viral outrage over Apple's EU payment warning misses important facts

    May 15, 2025

    Tiktok unveils a new meditation feature that will help you get off the app and sleep

    May 15, 2025
  • Crypto

    Robinhood expands its footprint in Canada by getting Wonderfi

    May 13, 2025

    Stripe unveils AI Foundation model for payments, revealing a “deeper partnership” with Nvidia

    May 7, 2025

    Movie Pass explores the daily fantasy platform of film buffs

    May 1, 2025

    Speaking on TechCrunch 2025: Application is open

    April 24, 2025

    Revolut, a $45 billion Neobank, recorded a profit of $1 billion in 2024

    April 24, 2025
  • Security

    American man spiked the price of Bitcoin hacked SEC X account and sentenced to prison

    May 16, 2025

    Coinbase says that customer's personal information was stolen in a data breach

    May 15, 2025

    White House Scrap plans to block data brokers from selling sensitive American data

    May 14, 2025

    Xai's promised safety report is MIA

    May 13, 2025

    Seven things we learned from WhatsApp vs. NSO Group Spyware Litigation

    May 13, 2025
  • Startups

    7 days left: Founders and VCs save over $300 on all stage passes

    March 24, 2025

    AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

    March 24, 2025

    20 Hottest Open Source Startups of 2024

    March 22, 2025

    Andrill may build a weapons factory in the UK

    March 21, 2025

    Startup Weekly: Wiz bets paid off at M&A Rich Week

    March 21, 2025
  • TechCrunch

    OpenSea takes a long-term view with a focus on UX despite NFT sales remaining low

    February 8, 2024

    AI will save software companies' growth dreams

    February 8, 2024

    B2B and B2C are not about who buys, but how you sell

    February 5, 2024

    It's time for venture capital to break away from fast fashion

    February 3, 2024

    a16z's Chris Dixon believes it's time to focus on blockchain use cases rather than speculation

    February 2, 2024
  • Venture

    Build, not bind: Accel's Sonali de Rycker on European AI Crossroads

    May 17, 2025

    How Silicon Valley's influence in Washington benefits high-tech elites

    May 16, 2025

    Red Point raises $650 million three years from the last big early stage fund

    May 15, 2025

    Lip Ring vs Deal Unpacking: Corporate Spy and $16.8 billion Plot Twist

    May 14, 2025

    A $2.5 billion treasured chime file for IPO reveals a $33 million deal with the Dallas Mavericks

    May 13, 2025
TechBrunchTechBrunch

The promise and dangers of synthetic data

TechBrunchBy TechBrunchDecember 24, 20249 Mins Read
Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
Share
Facebook Twitter LinkedIn Pinterest Telegram Email


Is it possible to train an AI using only data generated by another AI? It may sound like a crazy idea. However, this has been around for quite some time and is gaining traction as new real data is becoming increasingly difficult to obtain.

Anthropic used some synthetic data to train one of its flagship models, Claude 3.5 Sonnet. Meta used AI-generated data to fine-tune the Llama 3.1 model. And OpenAI is said to be sourcing synthetic training data from o1, its “inference” model, for the upcoming Orion.

But why does AI need data in the first place? And what kind of data does it need? And can this data really be replaced with synthetic data?

Importance of annotations

AI systems are statistical machines. They are trained on many examples and learn patterns in those examples to make predictions, such as “to whom” in an email usually comes before “may be of concern.”

Annotations are typically text that labels the meaning or portions of the data that these systems capture. Annotations are the important part of these examples. They serve as guideposts and “teach” us a model for distinguishing things, places, and ideas.

Consider a photo classification model that displays many photos of kitchens with the word “kitchen” attached. During training, the model starts associating “kitchen” with general characteristics of a kitchen, such as containing a refrigerator and countertop. After training, the model should be able to identify it as such if it is given a picture of a kitchen that is not included in the first example. (Of course, if the kitchen photo is labeled “cow,” it will be identified as a cow, thus highlighting the importance of proper annotation.)

The demand for AI and the need to provide labeled data for its development has expanded the market for annotation services. Dimension Market Research estimates that it is currently worth $838.2 million and will be worth $10.34 billion over the next 10 years. There are no precise estimates of how many people work on labeling jobs, but a 2022 paper puts the number in the “millions.”

Companies large and small rely on workers employed by data annotation companies to create labels for their AI training sets. Some of these jobs can pay quite well, especially if labeling requires specialized knowledge (such as mathematical expertise). Others may find it difficult. Annotators in developing countries are paid, on average, just a few dollars an hour, with no benefits or guarantees of future work.

Drying data wells

There is therefore a humanitarian reason to seek alternatives to human-created labels. For example, Uber is expanding its fleet of gig workers who work on AI annotation and labeling of data. But there are also practical ones.

Humans can only label something very quickly. Annotators also have biases that can show up in their annotations and then in models trained on them. Annotators make mistakes and stumble when labeling instructions. And paying humans to do something is expensive.

What's more, data is generally expensive. Shutterstock charges AI vendors tens of millions of dollars for access to its archives, while Reddit makes hundreds of millions of dollars licensing its data to the likes of Google and OpenAI.

Finally, data acquisition is also becoming more difficult.

Most models are trained on large collections of public data. Increasingly, data owners are choosing to control their data without worrying about it being stolen or not getting credit or attribution. Currently, more than 35% of the world's top 1,000 websites block OpenAI's web scrapers. Additionally, one recent study found that approximately 25% of data from “high-quality” sources is restricted from the primary datasets used to train models.

Research group Epoch AI predicts that if current trends in blocking access continue, developers will run out of data to train their generative AI models between 2026 and 2032. That, combined with concerns about copyright lawsuits and objectionable material invading open datasets, has forced AI vendors into liquidation.

synthetic substitutes

At first glance, it appears that synthetic data can solve all these problems. Need annotations? Generate them. More sample data? No problem. The sky is the limit.

And this is true to some extent.

“If data is the new oil, then synthetic data is being touted as a biofuel that can be created without any real negative externalities,” he says, researching the ethical implications of emerging technologies. Os Keyes, a doctoral candidate at the University of Washington, told TechCrunch. . “You can take a small starting set of data and then simulate and extrapolate new entries from there.”

The AI ​​industry has adopted this concept and is running with it.

Writer, an enterprise generative AI company, this month debuted Palmyra X 004, a model trained almost entirely on synthetic data. According to Writer, it cost just $700,000 to develop. In comparison, an OpenAI model of similar size had an estimated cost of $4.6 million.

Microsoft's Phi open model was partially trained using synthetic data. Google's Gemma model was similar. Nvidia announced this summer a family of models designed to generate synthetic training data, and AI startup Hugging Face recently released what it claims is the largest AI training dataset of synthetic text.

Synthetic data generation has become a business in itself, potentially reaching a value of $2.34 billion by 2030. Gartner predicts that 60% of the data used for AI and analytics projects will be generated synthetically this year.

Luca Soldaini, a senior researcher at the Allen Institute for AI, said synthetic data techniques can be used to generate training data in a format that cannot be easily obtained through scraping (or content licensing). For example, in training the video generator Movie Gen, Meta used Llama 3 to create captions for footage in the training data, which humans then adjusted to add details such as lighting descriptions.

Along these same lines, OpenAI says it used synthetic data to fine-tune GPT-4o and build sketchpad-like Canvas functionality for ChatGPT. And Amazon said it is generating synthetic data to supplement the real-world data used to train Alexa's voice recognition models.

“Synthetic data models allow us to quickly extend human intuition about what data is needed to achieve a particular model behavior,” Soldaini said.

synthetic risk

However, synthetic data is not a panacea. All AI suffers from the same “garbage in, garbage out” problem. Models create synthetic data, but if the data used to train these models has biases or limitations, their output will be contaminated as well. For example, groups that are not well represented in the base data are well represented in the synthetic data.

“The problem is, there are limits to what you can do,” Keyes said. “Say there are only 30 black people in your data set. Extrapolating may be helpful, but if those 30 people are all middle class or all light-skinned, then you have a representative ” All the data will look like this. ”

At this point, a 2023 study by researchers at Rice University and Stanford University found that relying too heavily on synthetic data during training can create models with “gradually decreasing quality and diversity.” It turns out that there is. The researchers say that sampling bias (a poor representation of the real world) worsens model diversity after several generations of training (though they also found that mixing in a little real-world data can alleviate this) ).

Keyes believes that complex models such as OpenAI's o1 have an additional risk, that they can create hallucinations that are difficult to detect in synthetic data. These can reduce the accuracy of models trained on the data, especially when it is difficult to determine the cause of the hallucinations.

“Complex models create hallucinations. The data produced by complex models contains hallucinations,” Keyes added. “And with models like o1, developers themselves can't necessarily explain why artifacts appear.”

Combined hallucinations can create a model spouting gibberish. A study published in the journal Nature explores how models trained on erroneous data generate even more erroneous data, and how this feedback loop influences future generations of models. It reveals whether it causes deterioration. The researchers found that over generations, models became less able to grasp more esoteric knowledge, became more general, and often produced answers that were unrelated to the questions being asked.

Image credit: Ilia Shumailov et al.

Follow-up studies have shown that other types of models, such as image generators, are also immune to this type of collapse.

Image credit: Ilia Shumailov et al.

Soldaini agrees that “raw” synthetic data cannot be trusted, at least when the goal is to avoid training forgetful chatbots or homogeneous image generators. To use it “safely”, he says, it needs to be thoroughly reviewed, curated, filtered, and ideally combined with new, real-world data, just like any other dataset. .

Failure to do so may ultimately lead to model collapse, making the model's output less “creative” and more biased, and ultimately severely impairing its functionality. Masu. This process can lead to identification and arrest before things become serious, but it does come with risks.

“Researchers need to examine the data generated, repeat the generation process, and identify safeguards to remove low-quality data points,” Soldaini said. “Synthetic data pipelines are not self-improving machines; their output must be carefully inspected and improved before being used for training.”

OpenAI CEO Sam Altman once argued that AI will one day generate enough synthetic data to effectively train itself. But even if it were possible, the technology doesn't exist yet. No major AI lab releases models trained solely on synthetic data.

At least for the time being, it seems like humans will need to be involved somewhere to ensure that the model training doesn't fail.

TechCrunch has a newsletter focused on AI. Sign up here to get it delivered to your inbox every Wednesday.

Update: This article was originally published on October 23rd and updated with details on December 24th.



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

OpenAI seeks to extend human lifespans with the help of longevity startups

January 17, 2025

Farewell to the $200 million woolly mammoth and TikTok

January 17, 2025

Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

January 17, 2025

Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

January 16, 2025

Apple suspends AI notification summaries for news after generating false alerts

January 16, 2025

Nvidia releases more tools and guardrails to help enterprises adopt AI agents

January 16, 2025

Leave A Reply Cancel Reply

Top Reviews
Editors Picks

7 days left: Founders and VCs save over $300 on all stage passes

March 24, 2025

AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

March 24, 2025

20 Hottest Open Source Startups of 2024

March 22, 2025

Andrill may build a weapons factory in the UK

March 21, 2025
About Us
About Us

Welcome to Tech Brunch, your go-to destination for cutting-edge insights, news, and analysis in the fields of Artificial Intelligence (AI), Cryptocurrency, Technology, and Startups. At Tech Brunch, we are passionate about exploring the latest trends, innovations, and developments shaping the future of these dynamic industries.

Our Picks

Build, not bind: Accel's Sonali de Rycker on European AI Crossroads

May 17, 2025

Google I/O 2025: What to expect including Gemini and Android 16 updates?

May 16, 2025

How Silicon Valley's influence in Washington benefits high-tech elites

May 16, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

© 2025 TechBrunch. Designed by TechBrunch.
  • Home
  • About Tech Brunch
  • Advertise with Tech Brunch
  • Contact us
  • DMCA Notice
  • Privacy Policy
  • Terms of Use

Type above and press Enter to search. Press Esc to cancel.