Large training data sets are the gateway to powerful AI models, but they can also often break down those models.
Bias arises from patterns of bias hidden in large datasets, such as photos of mostly white CEOs in an image classification set. Big data sets can also be complex because they come in a format that is incomprehensible to models, one that contains a lot of noise and extraneous information.
A recent Deloitte survey of companies implementing AI found that 40% said data-related challenges, including thorough data preparation and cleaning, were one of the top concerns hindering their AI efforts. I answered. Another survey of data scientists found that approximately 45% of scientists' time is spent on data preparation tasks such as “loading” and cleaning data.
Ari Morcos, who has worked in the AI industry for nearly a decade, wanted to abstract away much of the data preparation process involved in training AI models, and he founded a startup to do just that.
Morcos' company, DatologyAI, builds tools to automatically curate data sets like those used to train OpenAI's ChatGPT, Google's Gemini, and other GenAI models. According to Morcos, the platform provides ways to augment datasets with additional data, batch processing them, or splitting them into more manageable chunks, as well as ways to expand them depending on the application of the model (such as composing an email). to identify which data is most important. Model training in progress.
“The model is what they eat. The model reflects the data based on the training,” Morkos told TechCrunch in an email interview. “However, not all data is created equal, and some training data is much more useful than others. If you train a model in the right way, on the right data, It can have a dramatic impact on the resulting model.”
Mr. Morkos holds a Ph.D. in neuroscience from Harvard University, and spent two years at DeepMind applying neurology-inspired techniques to understand and improve his AI models. Also, during his five years in his AI lab at Meta, he uncovered some of the fundamental mechanisms underlying model functionality. Morcos, along with co-founders Matthew Leavitt and Bogdan Gaza, former head of engineering at Amazon and then Twitter, is a company that aims to streamline the curation of his AI datasets in all formats. We have launched DatologyAI.
As Morcos points out, the composition of a training dataset affects nearly every characteristic of the model trained on it, from the model's performance on a task to its size and depth of domain knowledge. . More efficient data sets can reduce training time and produce smaller models to save computing costs, while data sets that include a particularly diverse range of samples can better handle difficult requests. (Generally speaking).
Interest in GenAI, which is notoriously expensive, is at an all-time high, and executives are putting AI deployment costs at the forefront of their minds.
Many companies are either fine-tuning existing models (including open source models) to suit their purposes, or opting for managed vendor services via APIs. However, for reasons such as governance and compliance, some companies choose to build models from scratch based on custom data and spend tens to millions of dollars in compute to train and run those models. spending.
“Enterprises are collecting a treasure trove of data and want to train efficient, high-performance specialized AI models that can maximize the benefits to their business,” Morkos said. “However, these huge datasets are extremely difficult to use effectively, and incorrect usage can result in poor model performance and increased training and training times. [are larger] more than necessary. ”
DatologyAI can scale up to “petabytes'' of data in any format, including text, images, video, audio, tabular, or more “exotic'' modalities such as genomic or geospatial, and can be delivered to customers' infrastructure on-premises or over a network. Can be expanded into structures. virtual private cloud. This sets it apart from other data preparation and curation tools such as CleanLab, Lilac, Labelbox, YData, and Galileo, which tend to be more limited in the scope and type of data they can process, Morcos said. I am claiming.
DatologyAI determines which “concepts” in the data set are more complex (e.g., concepts related to U.S. history in the educational chatbot training set) and therefore require high-quality samples, and which data feeds into the model. You can also determine if there is any potential impact. act unintentionally.
“Resolving” [these problems] We need to automatically identify the concept, its complexity, and how much redundancy is actually needed,” Morkos said. “Data augmentation using other models or synthetic data is often very powerful, but it must be done in a careful and targeted manner. ”
The question is, how effective is DatologyAI's technology? There are reasons to be skeptical. History has shown that automated data curation, no matter how sophisticated the methodology or the diversity of the data, does not always work as intended.
LAION, a German nonprofit that leads many GenAI projects, was forced to remove its algorithmically-curated AI training data set after it was found to contain images of child sexual abuse. Elsewhere, models such as ChatGPT, which are trained on a mixture of manually and automatically filtered datasets for harmfulness, have been shown to produce harmful content when given certain prompts.
Some experts would argue that there is no escaping manual curation, at least if you want to achieve good results with AI models. From AWS to Google to OpenAI, today's largest vendors rely on teams of human experts and (sometimes underpaid) annotators to shape and refine their training data sets.
Morcos claims that DatologyAI's tools are not intended to: Exchange Rather than completely manual curation, it provides suggestions that data scientists may not have thought of, especially those that are unrelated to the problem of trimming the size of training data sets. He is some authority. Cropping datasets while preserving model performance was the focus of his academic paper. Morcos co-authored the paper in 2022 with researchers from Stanford University and the University of Tübingen, which won the best paper award at that year's NeurIPS machine learning conference.
“Identifying the right data at scale is extremely difficult and is a cutting-edge research challenge,” Morkos said. “[Our approach] This dramatically speeds up model training while improving the performance of downstream tasks. ”
DatologyAI's technology is at the heart of modern AI, including Jeff Dean, Chief Scientist at Google, Yann LeCun, Chief AI Scientist at Meta, Adam D'Angelo, Quora Founder and OpenAI Board Member It is credited with developing some of the most important technologies.
Other angel investors in DatologyAI's $11.65 million seed led by Amplify Partners with participation from Radical Ventures, Conviction Capital, Outset Capital and Quiet Capital are Cohere co-founders Aidan Gomez and Ivan Zhang , Douwe Kiela, founder of Contextual AI and formerly of Intel. His vice president of AI is Naveen Rao and his Jascha Sohl-Dickstein, one of the inventors of the generative diffusion model. This is an impressive list of AI luminaries, to say the least, and suggests there may be something to Molkos' claims.
“A model is only as good as the data it is trained on, and identifying the right training data among billions or even trillions of samples is an incredibly difficult problem.” LeCun told TechCrunch in an emailed statement. “Ari and the team at DatologyAI are among the world's experts on this issue, and the products they're building to make high-quality data curation available to anyone who wants to train models. , which I believe is critical to helping AI function for everyone.”
San Francisco-based DatologyAI currently has 10 employees, including its co-founder, and plans to expand to about 25 by the end of the year if it reaches certain growth milestones.
I asked Morcos if these milestones were related to customer acquisition, but he declined to say. And, rather strangely, it did not reveal the size of DatologyAI's current customer base.