Close Menu
TechBrunchTechBrunch
  • Home
  • AI
  • Apps
  • Crypto
  • Security
  • Startups
  • TechCrunch
  • Venture

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

The court denied requests to suspend awards regarding Apple's App Store payment fees

June 6, 2025

Circle IPOs are giving hope to more startups waiting to be published to more startups

June 5, 2025

Perplexity received 780 million questions last month, the CEO says

June 5, 2025
Facebook X (Twitter) Instagram
TechBrunchTechBrunch
  • Home
  • AI

    OpenAI seeks to extend human lifespans with the help of longevity startups

    January 17, 2025

    Farewell to the $200 million woolly mammoth and TikTok

    January 17, 2025

    Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

    January 17, 2025

    Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

    January 16, 2025

    Apple suspends AI notification summaries for news after generating false alerts

    January 16, 2025
  • Apps

    The court denied requests to suspend awards regarding Apple's App Store payment fees

    June 6, 2025

    Perplexity received 780 million questions last month, the CEO says

    June 5, 2025

    Bonfire's new software allows users to build their own social communities free from platform control

    June 5, 2025

    x Test to highlight posts that users with dissent

    June 5, 2025

    Google says the updated Gemini 2.5 Pro AI model is excellent at coding

    June 5, 2025
  • Crypto

    Circle IPOs are giving hope to more startups waiting to be published to more startups

    June 5, 2025

    GameStop bought $500 million in Bitcoin

    May 28, 2025

    Vote for the session you want to watch in 2025

    May 26, 2025

    Save $900 + 90% from 2 tickets to destroy 2025 in the last 24 hours

    May 25, 2025

    Only 3 days left to save up to $900 to destroy the 2025 pass

    May 23, 2025
  • Security

    Humanity unveils custom AI models for US national security customers

    June 5, 2025

    Unlock phone company Cellebrite to acquire mobile testing startup Corellium for $170 million

    June 5, 2025

    Ransomware Gangs claim responsibility for Kettering Health Hack

    June 4, 2025

    Former CTO of CrowdStrike's cyber-rivals and how automation can undermine security for early-stage startups

    June 4, 2025

    Data breaches at newspaper giant Lee Enterprises impact 40,000 people

    June 4, 2025
  • Startups

    7 days left: Founders and VCs save over $300 on all stage passes

    March 24, 2025

    AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

    March 24, 2025

    20 Hottest Open Source Startups of 2024

    March 22, 2025

    Andrill may build a weapons factory in the UK

    March 21, 2025

    Startup Weekly: Wiz bets paid off at M&A Rich Week

    March 21, 2025
  • TechCrunch

    OpenSea takes a long-term view with a focus on UX despite NFT sales remaining low

    February 8, 2024

    AI will save software companies' growth dreams

    February 8, 2024

    B2B and B2C are not about who buys, but how you sell

    February 5, 2024

    It's time for venture capital to break away from fast fashion

    February 3, 2024

    a16z's Chris Dixon believes it's time to focus on blockchain use cases rather than speculation

    February 2, 2024
  • Venture

    Less than 48 hours left until display at TC at all stages

    June 5, 2025

    TC Session: AI will be on sale today at Berkeley

    June 5, 2025

    North America accounts for the majority of AI VC investment despite the harsh political environment

    June 5, 2025

    3 days left: Charge all your locations in stages on TC Expo Floor

    June 4, 2025

    From $5 to Financial Empowerment: Why Stash co-founder Brandon Krieg is a must-see for TechCrunch All Stage 2025

    June 4, 2025
TechBrunchTechBrunch

DatologyAI is building technology to automatically curate AI training datasets

TechBrunchBy TechBrunchFebruary 22, 20247 Mins Read
Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
Share
Facebook Twitter LinkedIn Pinterest Telegram Email


Large training data sets are the gateway to powerful AI models, but they can also often break down those models.

Bias arises from patterns of bias hidden in large datasets, such as photos of mostly white CEOs in an image classification set. Big data sets can also be complex because they come in a format that is incomprehensible to models, one that contains a lot of noise and extraneous information.

A recent Deloitte survey of companies implementing AI found that 40% said data-related challenges, including thorough data preparation and cleaning, were one of the top concerns hindering their AI efforts. I answered. Another survey of data scientists found that approximately 45% of scientists' time is spent on data preparation tasks such as “loading” and cleaning data.

Ari Morcos, who has worked in the AI ​​industry for nearly a decade, wanted to abstract away much of the data preparation process involved in training AI models, and he founded a startup to do just that.

Morcos' company, DatologyAI, builds tools to automatically curate data sets like those used to train OpenAI's ChatGPT, Google's Gemini, and other GenAI models. According to Morcos, the platform provides ways to augment datasets with additional data, batch processing them, or splitting them into more manageable chunks, as well as ways to expand them depending on the application of the model (such as composing an email). to identify which data is most important. Model training in progress.

“The model is what they eat. The model reflects the data based on the training,” Morkos told TechCrunch in an email interview. “However, not all data is created equal, and some training data is much more useful than others. If you train a model in the right way, on the right data, It can have a dramatic impact on the resulting model.”

Mr. Morkos holds a Ph.D. in neuroscience from Harvard University, and spent two years at DeepMind applying neurology-inspired techniques to understand and improve his AI models. Also, during his five years in his AI lab at Meta, he uncovered some of the fundamental mechanisms underlying model functionality. Morcos, along with co-founders Matthew Leavitt and Bogdan Gaza, former head of engineering at Amazon and then Twitter, is a company that aims to streamline the curation of his AI datasets in all formats. We have launched DatologyAI.

As Morcos points out, the composition of a training dataset affects nearly every characteristic of the model trained on it, from the model's performance on a task to its size and depth of domain knowledge. . More efficient data sets can reduce training time and produce smaller models to save computing costs, while data sets that include a particularly diverse range of samples can better handle difficult requests. (Generally speaking).

Interest in GenAI, which is notoriously expensive, is at an all-time high, and executives are putting AI deployment costs at the forefront of their minds.

Many companies are either fine-tuning existing models (including open source models) to suit their purposes, or opting for managed vendor services via APIs. However, for reasons such as governance and compliance, some companies choose to build models from scratch based on custom data and spend tens to millions of dollars in compute to train and run those models. spending.

“Enterprises are collecting a treasure trove of data and want to train efficient, high-performance specialized AI models that can maximize the benefits to their business,” Morkos said. “However, these huge datasets are extremely difficult to use effectively, and incorrect usage can result in poor model performance and increased training and training times. [are larger] more than necessary. ”

DatologyAI can scale up to “petabytes'' of data in any format, including text, images, video, audio, tabular, or more “exotic'' modalities such as genomic or geospatial, and can be delivered to customers' infrastructure on-premises or over a network. Can be expanded into structures. virtual private cloud. This sets it apart from other data preparation and curation tools such as CleanLab, Lilac, Labelbox, YData, and Galileo, which tend to be more limited in the scope and type of data they can process, Morcos said. I am claiming.

DatologyAI determines which “concepts” in the data set are more complex (e.g., concepts related to U.S. history in the educational chatbot training set) and therefore require high-quality samples, and which data feeds into the model. You can also determine if there is any potential impact. act unintentionally.

“Resolving” [these problems] We need to automatically identify the concept, its complexity, and how much redundancy is actually needed,” Morkos said. “Data augmentation using other models or synthetic data is often very powerful, but it must be done in a careful and targeted manner. ”

The question is, how effective is DatologyAI's technology? There are reasons to be skeptical. History has shown that automated data curation, no matter how sophisticated the methodology or the diversity of the data, does not always work as intended.

LAION, a German nonprofit that leads many GenAI projects, was forced to remove its algorithmically-curated AI training data set after it was found to contain images of child sexual abuse. Elsewhere, models such as ChatGPT, which are trained on a mixture of manually and automatically filtered datasets for harmfulness, have been shown to produce harmful content when given certain prompts.

Some experts would argue that there is no escaping manual curation, at least if you want to achieve good results with AI models. From AWS to Google to OpenAI, today's largest vendors rely on teams of human experts and (sometimes underpaid) annotators to shape and refine their training data sets.

Morcos claims that DatologyAI's tools are not intended to: Exchange Rather than completely manual curation, it provides suggestions that data scientists may not have thought of, especially those that are unrelated to the problem of trimming the size of training data sets. He is some authority. Cropping datasets while preserving model performance was the focus of his academic paper. Morcos co-authored the paper in 2022 with researchers from Stanford University and the University of Tübingen, which won the best paper award at that year's NeurIPS machine learning conference.

“Identifying the right data at scale is extremely difficult and is a cutting-edge research challenge,” Morkos said. “[Our approach] This dramatically speeds up model training while improving the performance of downstream tasks. ”

DatologyAI's technology is at the heart of modern AI, including Jeff Dean, Chief Scientist at Google, Yann LeCun, Chief AI Scientist at Meta, Adam D'Angelo, Quora Founder and OpenAI Board Member It is credited with developing some of the most important technologies.

Other angel investors in DatologyAI's $11.65 million seed led by Amplify Partners with participation from Radical Ventures, Conviction Capital, Outset Capital and Quiet Capital are Cohere co-founders Aidan Gomez and Ivan Zhang , Douwe Kiela, founder of Contextual AI and formerly of Intel. His vice president of AI is Naveen Rao and his Jascha Sohl-Dickstein, one of the inventors of the generative diffusion model. This is an impressive list of AI luminaries, to say the least, and suggests there may be something to Molkos' claims.

“A model is only as good as the data it is trained on, and identifying the right training data among billions or even trillions of samples is an incredibly difficult problem.” LeCun told TechCrunch in an emailed statement. “Ari and the team at DatologyAI are among the world's experts on this issue, and the products they're building to make high-quality data curation available to anyone who wants to train models. , which I believe is critical to helping AI function for everyone.”

San Francisco-based DatologyAI currently has 10 employees, including its co-founder, and plans to expand to about 25 by the end of the year if it reaches certain growth milestones.

I asked Morcos if these milestones were related to customer acquisition, but he declined to say. And, rather strangely, it did not reveal the size of DatologyAI's current customer base.



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

OpenAI seeks to extend human lifespans with the help of longevity startups

January 17, 2025

Farewell to the $200 million woolly mammoth and TikTok

January 17, 2025

Nord Security founder launches Nexos.ai to help enterprises move AI projects from pilot to production

January 17, 2025

Data proves it remains difficult for startups to raise capital, even though VCs invested $75 billion in the fourth quarter

January 16, 2025

Apple suspends AI notification summaries for news after generating false alerts

January 16, 2025

Nvidia releases more tools and guardrails to help enterprises adopt AI agents

January 16, 2025

Leave A Reply Cancel Reply

Top Reviews
Editors Picks

7 days left: Founders and VCs save over $300 on all stage passes

March 24, 2025

AI chip startup Furiosaai reportedly rejecting $800 million acquisition offer from Meta

March 24, 2025

20 Hottest Open Source Startups of 2024

March 22, 2025

Andrill may build a weapons factory in the UK

March 21, 2025
About Us
About Us

Welcome to Tech Brunch, your go-to destination for cutting-edge insights, news, and analysis in the fields of Artificial Intelligence (AI), Cryptocurrency, Technology, and Startups. At Tech Brunch, we are passionate about exploring the latest trends, innovations, and developments shaping the future of these dynamic industries.

Our Picks

The court denied requests to suspend awards regarding Apple's App Store payment fees

June 6, 2025

Circle IPOs are giving hope to more startups waiting to be published to more startups

June 5, 2025

Perplexity received 780 million questions last month, the CEO says

June 5, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

© 2025 TechBrunch. Designed by TechBrunch.
  • Home
  • About Tech Brunch
  • Advertise with Tech Brunch
  • Contact us
  • DMCA Notice
  • Privacy Policy
  • Terms of Use

Type above and press Enter to search. Press Esc to cancel.