Data is at the heart of today's advanced AI systems, but it is becoming increasingly expensive and out of reach for all but the wealthiest tech companies.
Last year, OpenAI researcher James Betker wrote a post on his personal blog about the nature of generative AI models and the datasets they are trained on, arguing that the training data, rather than the model design, architecture, or other characteristics, is the key to enabling increasingly sophisticated and performant AI systems.
“If you train on the same dataset for long enough, nearly all models will converge to the same point,” Betker wrote.
Is Betker right? Is training data the biggest factor in determining what a model can do, like answer a question, draw a human hand, or generate a realistic cityscape?
That is certainly plausible.
Statistical Machine
Generative AI systems are essentially probabilistic models – huge collections of statistics – that infer, based on huge numbers of examples, what data “makes the most sense” to place where (for example, the word “go” before “to the market” in the sentence “I go to the market”). So it makes intuitive sense that the more examples a model has to use, the better a model trained on those examples will perform.
“The performance improvements seem to come from the data,” Kyle Lo, senior applied research scientist at the Allen Institute for AI (AI2), an AI research nonprofit, told TechCrunch, “at least once you have a stable training environment.”
Lo gave the example of Meta's Llama 3 text generation model, released earlier this year, which outperforms AI2's own OLMo model, despite being architecturally very similar. Llama 3 was trained on much more data than OLMo, which Lo believes is why it outperforms in many common AI benchmarks.
(It’s worth pointing out here that the benchmarks currently in wide use in the AI industry aren’t necessarily the best measures of model performance, but outside of qualitative tests like ours, they are one of the few measures we can rely on.)
This is not to suggest that training on exponentially larger datasets is a surefire path to exponentially better models: Lo points out that models operate on a “garbage in, garbage out” paradigm, so data curation and quality are probably much more important than quantity itself.
“It's possible that a smaller model with carefully designed data can outperform a larger model,” he added. “For example, the larger Falcon 180B is ranked 63rd in the LMSYS benchmark, while the much smaller Llama 2 13B is ranked 56th.”
In an interview with TechCrunch last October, OpenAI researcher Gabriel Goh said that higher quality annotations were a big factor in why OpenAI's text-to-image model, DALL-E 3, produced better image quality than its predecessor, DALL-E 2. “I think that's the main source of improvement,” he said. “Text annotations are a lot better now than they were before. [with DALL-E 2] “You can't even compare it.”
Many AI models, including DALL-E 3 and DALL-E 2, are trained by human annotators who label data so that the model can learn to associate those labels with other observed characteristics of that data. For example, a model fed a large number of photos of cats, each with annotated breeds, will eventually “learn” to associate terms like bobtail and shorthair with their distinctive visual features.
Bad behavior
Experts like Lo worry that the increasing emphasis on large, high-quality training datasets will concentrate AI development in the hands of a few companies with billion-dollar budgets that can afford them. Synthetic data or major innovations in underlying architectures could be game-changers, but neither seem likely in the near future.
“Across the board, organizations that control content that could be useful for AI development have an incentive to lock down that material,” Lo says. “And as access to the data closes, we're essentially congratulating a few pioneers in data acquisition and raising the ladder so that no one else can access the data and catch up.”
Indeed, when the race to collect more training data isn't leading to unethical (and possibly illegal) practices like the secret collection of copyrighted content, it's benefiting tech giants with deep pockets to spend on data licenses.
Generative AI models, such as those from OpenAI, are primarily trained on images, text, audio, video, and other data (some of which is copyrighted) taken from public web pages (and AI-generated pages are problematic as well). OpenAIs around the world argue that fair use protects them from legal retaliation. Many rights holders disagree, but at least for now, there isn't much that can be done to stop the practice.
There are numerous examples of generative AI vendors using questionable means to obtain huge datasets to train their models: OpenAI reportedly transcribed more than a million hours of YouTube videos to feed into its flagship model, GPT-4, without YouTube's permission or that of the creators. Google recently expanded some of its terms of service to allow its AI products to use publicly available Google Docs, restaurant reviews on Google Maps, and other online material, and Meta is said to have considered training its models on IP-protected content, even at the risk of litigation.
Meanwhile, companies large and small rely on Third World workers who make just a few dollars an hour to annotate their training sets. Some of these annotators, employed by huge startups like Scale AI, literally work days without benefits or guarantee of future work to complete tasks that expose them to graphic images of violence and gore.
Rising costs
In other words, even fairer data trade does not foster an open and equitable generative AI ecosystem.
OpenAI has spent hundreds of millions of dollars licensing content from news publishers, stock media libraries and others to train its AI models — far more than most academic research groups, nonprofits or startups can afford. Meta even considered buying publisher Simon & Schuster for the rights to e-book excerpts (it was eventually sold to private equity firm KKR for $1.62 billion in 2023).
The market for AI training data is expected to grow from about $2.5 billion today to nearly $30 billion within a decade, and data brokers and platforms are rushing to charge top dollar, in some cases over the objections of their user base.
Stock media library Shutterstock has signed deals with AI vendors ranging from $25 million to $50 million, and Reddit claims to have made hundreds of millions of dollars licensing data to organizations like Google and OpenAI. It appears that there are few platforms with a wealth of data accumulated organically over the years, from Photobucket to Tumblr to Q&A site Stack Overflow, that haven't struck deals with generative AI developers.
The data is what the platforms sell — or at least that's what it is, depending on which legal argument you believe — but in most cases, users don't get a penny of the benefits, and that's having a negative impact on the entire AI research community.
“Small businesses will not be able to afford these data licenses and will therefore be unable to develop and research AI models,” Lo said. “We are concerned that this will lead to a lack of independent oversight of AI development practices.”
Independent Initiative
If there is one ray of light in the darkness, it is a handful of independent, non-profit efforts to create massive datasets that anyone can use to train generative AI models.
EleutherAI, a grassroots non-profit research group that began as a loose Discord collective in 2020, is collaborating with the University of Toronto, AI2, and independent researchers to create The Pile v2, a set of billions of text sentences collected primarily from the public domain.
In April, AI startup Hugging Face released FineWeb, a filtered version of Common Crawl, a dataset of the same name maintained by the nonprofit organization Common Crawl. The dataset consists of billions of web pages, which Hugging Face claims will improve the performance of its models on a number of benchmarks.
There are some efforts to make training datasets publicly available, such as the LAION group's image set, but they face copyright, data privacy, and other equally serious ethical and legal challenges. However, some of the more dedicated data curators are promising improvements. For example, The Pile v2 has removed problematic copyrighted material found in The Pile, the dataset on which it was based.
The question is whether these open efforts can keep pace with big tech companies: As long as collecting and organizing data is a resource issue, the answer is probably no—at least until some research breakthrough levels the playing field.