Close Menu
TechBrunchTechBrunch
  • Home
  • AI
  • Apps
  • Crypto
  • Security
  • Startups
  • TechCrunch
  • Venture

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Tinder test that allows users to set “height priority”

May 29, 2025

YouTube will quickly make viewers use Google lenses to search for what they are looking at while looking at shorts

May 29, 2025

US government sanctions technology company involved in cyber fraud

May 29, 2025
Facebook X (Twitter) Instagram
TechBrunchTechBrunch
  • Home
  • AI

    OpenAI seeks to extend human lifespans with the help of longevity startups

    January 17, 2025

    Farewell to the $200 million woolly mammoth and TikTok

    January 17, 2025

    Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

    January 17, 2025

    Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

    January 16, 2025

    Apple suspends AI notification summaries for news after generating false alerts

    January 16, 2025
  • Apps

    Tinder test that allows users to set “height priority”

    May 29, 2025

    YouTube will quickly make viewers use Google lenses to search for what they are looking at while looking at shorts

    May 29, 2025

    Apple's US App Store surpassed $400 billion in developer billing and sales in 2024

    May 29, 2025

    Meta AI has 1B active users every month

    May 29, 2025

    Meet LoveJack, a dating app designed for users to find Love using five words

    May 29, 2025
  • Crypto

    GameStop bought $500 million in Bitcoin

    May 28, 2025

    Vote for the session you want to watch in 2025

    May 26, 2025

    Save $900 + 90% from 2 tickets to destroy 2025 in the last 24 hours

    May 25, 2025

    Only 3 days left to save up to $900 to destroy the 2025 pass

    May 23, 2025

    Starting from up to $900 from Ticep, 90% off +1 in 2025

    May 22, 2025
  • Security

    US government sanctions technology company involved in cyber fraud

    May 29, 2025

    Ten years later, the bootstrap Thinkst Canary will reach $20 million ARR without VC funding

    May 29, 2025

    Security Startup Horizon3.AI raises $100 million in new rounds

    May 28, 2025

    When fighting a security incident, he was hit by Victoria's secret halt.

    May 28, 2025

    Data broker giant LexisNexis says more than 364,000 personal information has been violated

    May 28, 2025
  • Startups

    7 days left: Founders and VCs save over $300 on all stage passes

    March 24, 2025

    AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

    March 24, 2025

    20 Hottest Open Source Startups of 2024

    March 22, 2025

    Andrill may build a weapons factory in the UK

    March 21, 2025

    Startup Weekly: Wiz bets paid off at M&A Rich Week

    March 21, 2025
  • TechCrunch

    OpenSea takes a long-term view with a focus on UX despite NFT sales remaining low

    February 8, 2024

    AI will save software companies' growth dreams

    February 8, 2024

    B2B and B2C are not about who buys, but how you sell

    February 5, 2024

    It's time for venture capital to break away from fast fashion

    February 3, 2024

    a16z's Chris Dixon believes it's time to focus on blockchain use cases rather than speculation

    February 2, 2024
  • Venture

    7 days until doors open during session: ai

    May 29, 2025

    Geal Capital Partners announces nearly $82 million Fund II

    May 29, 2025

    Founder Sahil Lavingia says he was booted from Doge just 55 days later

    May 28, 2025

    Confuse your 2025 agenda: Vote for your favorite session

    May 28, 2025

    Competing with incumbents with linear Christina Cordoba in the session: ai

    May 28, 2025
TechBrunchTechBrunch

The promise and dangers of synthetic data

TechBrunchBy TechBrunchOctober 13, 20248 Mins Read
Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
Share
Facebook Twitter LinkedIn Pinterest Telegram Email


Is it possible to train an AI using only data generated by another AI? It may sound like a crazy idea. But this is data that has been around for quite some time and is gaining traction because new, real data is becoming increasingly difficult to obtain.

Anthropic used some synthetic data to train one of its flagship models, Claude 3.5 Sonnet. Meta used AI-generated data to fine-tune the Llama 3.1 model. And OpenAI is said to be sourcing synthetic training data from o1, its “inference” model, for the upcoming Orion.

But why does AI need data in the first place? And what kind of data does it need? And can this data really be replaced with synthetic data?

Importance of annotations

AI systems are statistical machines. They are trained on many examples and learn patterns in those examples to make predictions, such as “to whom” in an email usually comes before “may be of concern.”

Annotations are typically text that labels the meaning or portions of the data that these systems capture. Annotations are the important part of these examples. They serve as guideposts and “teach” models for distinguishing objects, places, and ideas.

Consider a photo classification model that displays many photos of kitchens with the word “kitchen” attached. During training, the model starts associating “kitchen” with general characteristics of a kitchen, such as containing a refrigerator and countertop. After training, the model should be able to identify it as such if it is given a picture of a kitchen that is not included in the first example. (Of course, if the kitchen photo is labeled “cow,” it will be identified as a cow, thus highlighting the importance of proper annotation.)

The demand for AI and the need to provide labeled data for its development has expanded the market for annotation services. Dimension Market Research estimates that it is currently worth $838.2 million and will be worth $10.34 billion over the next 10 years. There are no precise estimates of how many people work on labeling jobs, but a 2022 paper puts the number in the “millions.”

Companies large and small rely on workers employed by data annotation companies to create labels for AI training sets. Some of these jobs can pay quite well, especially if labeling requires specialized knowledge (such as mathematical expertise). Others may find it difficult. Annotators in developing countries are paid, on average, just a few dollars an hour, with no benefits or guarantees of future work.

Drying data wells

There is therefore a humanitarian reason to seek alternatives to human-created labels. But there are also practical ones.

Humans can only label something very quickly. Annotators also have biases that can show up in their annotations and then in models trained on them. Annotators make mistakes and stumble when labeling instructions. And paying humans to do something is expensive.

What's more, data is generally expensive. Shutterstock charges AI vendors tens of millions of dollars for access to its archives, while Reddit makes hundreds of millions of dollars licensing its data to the likes of Google and OpenAI.

Finally, data acquisition is also becoming more difficult.

Most models are trained on large collections of public data. Data owners are increasingly choosing to gate their data for fear of it being stolen or not receiving credit or attribution. Over 35% of the world's top 1,000 websites currently block OpenAI's web scrapers. Additionally, one recent study found that approximately 25% of data from “high-quality” sources is restricted from the primary datasets used to train models.

Research group Epoch AI predicts that if current trends in blocking access continue, developers will run out of data to train their generative AI models between 2026 and 2032. Add to this concerns about copyright lawsuits and objectionable material invading open datasets. , forcing AI vendors into liquidation.

synthetic substitutes

At first glance, it appears that synthetic data can solve all these problems. Need annotations? Generate them. More sample data? No problem. The sky is the limit.

And this is true to some extent.

“If data is the new oil, then synthetic data is being touted as a biofuel that can be created without any real negative externalities,” he says, researching the ethical implications of emerging technologies. Os Keyes, a doctoral candidate at the University of Washington, told TechCrunch. . “You can take a small starting set of data and then simulate and extrapolate new entries from there.”

The AI ​​industry has adopted this concept and is running with it.

Writer, an enterprise generative AI company, this month debuted Palmyra X 004, a model trained almost entirely on synthetic data. According to Writer, its development cost was just $700,000, compared to an estimated cost of $4.6 million for a comparable OpenAI model.

Microsoft's Phi open model was partially trained using synthetic data. Google's Gemma model was similar. Nvidia announced this summer a family of models designed to generate synthetic training data, and AI startup Hugging Face recently released what it claims is the largest AI training dataset of synthetic text.

Synthetic data generation has become a business in itself, potentially reaching a value of $2.34 billion by 2030. Gartner predicts that 60% of the data used for AI and analytics projects will be generated synthetically this year.

Luca Soldaini, a senior researcher at the Allen Institute for AI, said synthetic data techniques can be used to generate training data in a format that cannot be easily obtained through scraping (or content licensing). For example, in training the video generator Movie Gen, Meta used Llama 3 to create captions for footage in the training data, which humans then adjusted to add details such as lighting descriptions.

Along these same lines, OpenAI says it used synthetic data to fine-tune GPT-4o and build sketchpad-like Canvas functionality for ChatGPT. And Amazon said it is generating synthetic data to supplement the real-world data used to train Alexa's voice recognition models.

“Synthetic data models allow us to quickly extend human intuition about what data is needed to achieve a particular model behavior,” Soldaini said.

synthetic risk

However, synthetic data is not a panacea. All AI suffers from the same “garbage in, garbage out” problem. Models create synthetic data, but if the data used to train these models has biases or limitations, their output will be contaminated as well. For example, groups that are not well represented in the base data are well represented in the synthetic data.

“The problem is, there are limits to what you can do,” Keyes said. “Say there are only 30 black people in your data set. Extrapolating may be helpful, but if those 30 people are all middle class or all light-skinned, then you have a representative ” All the data will look like this. ”

At this point, a 2023 study by researchers at Rice University and Stanford University found that relying too heavily on synthetic data during training can create models with “gradually decreasing quality and diversity.” It turns out that there is. Sampling bias (a poor representation of the real world) worsens model diversity after several generations of training, the researchers say (though they also found that mixing in a bit of real-world data can reduce this) ).

Keyes believes that complex models such as OpenAI's o1 have an additional risk, that they can create hallucinations that are difficult to detect in synthetic data. These can reduce the accuracy of models trained on the data, especially when it is difficult to determine the cause of the hallucinations.

“Complex models create hallucinations. The data produced by complex models contains hallucinations,” Keyes added. “And with models like o1, developers themselves can't necessarily explain why artifacts appear.”

Combined hallucinations can create a model spouting gibberish. A study published in the journal Nature explores how models trained on erroneous data generate even more erroneous data, and how this feedback loop influences future generations of models. It reveals whether it causes deterioration. The researchers found that over generations, models became less able to grasp more esoteric knowledge, became more general, and often produced answers that were unrelated to the questions being asked.

Image credit: Ilia Shumailov et al.

Follow-up studies have shown that other types of models, such as image generators, are also immune to this type of collapse.

Image credit: Ilia Shumailov et al.

Soldaini agrees that “raw” synthetic data cannot be trusted, at least when the goal is to avoid training forgetful chatbots or homogeneous image generators. To use it “safely”, he says, it needs to be thoroughly reviewed, curated, filtered, and ideally combined with new, real-world data, just like any other data set. .

Failure to do so may ultimately lead to model collapse, making the model's output less “creative” and more biased, and ultimately severely impairing its functionality. Masu. This process can lead to identification and arrest before things become serious, but it does come with risks.

“Researchers need to examine the data generated, repeat the generation process, and identify safeguards to remove low-quality data points,” Soldaini said. “Synthetic data pipelines are not self-improving machines; their output must be carefully inspected and improved before being used for training.”

OpenAI CEO Sam Altman once argued that AI will one day generate enough synthetic data to effectively train itself. But even if it were possible, the technology doesn't exist yet. No major AI lab releases models trained solely on synthetic data.

At least for the time being, it seems like humans will need to be involved somewhere to ensure that the model training doesn't fail.



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

OpenAI seeks to extend human lifespans with the help of longevity startups

January 17, 2025

Farewell to the $200 million woolly mammoth and TikTok

January 17, 2025

Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

January 17, 2025

Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

January 16, 2025

Apple suspends AI notification summaries for news after generating false alerts

January 16, 2025

Nvidia releases more tools and guardrails to help enterprises adopt AI agents

January 16, 2025

Leave A Reply Cancel Reply

Top Reviews
Editors Picks

7 days left: Founders and VCs save over $300 on all stage passes

March 24, 2025

AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

March 24, 2025

20 Hottest Open Source Startups of 2024

March 22, 2025

Andrill may build a weapons factory in the UK

March 21, 2025
About Us
About Us

Welcome to Tech Brunch, your go-to destination for cutting-edge insights, news, and analysis in the fields of Artificial Intelligence (AI), Cryptocurrency, Technology, and Startups. At Tech Brunch, we are passionate about exploring the latest trends, innovations, and developments shaping the future of these dynamic industries.

Our Picks

Tinder test that allows users to set “height priority”

May 29, 2025

YouTube will quickly make viewers use Google lenses to search for what they are looking at while looking at shorts

May 29, 2025

US government sanctions technology company involved in cyber fraud

May 29, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

© 2025 TechBrunch. Designed by TechBrunch.
  • Home
  • About Tech Brunch
  • Advertise with Tech Brunch
  • Contact us
  • DMCA Notice
  • Privacy Policy
  • Terms of Use

Type above and press Enter to search. Press Esc to cancel.