YouTube creators are filing a class action lawsuit against OpenAI, alleging that the company used millions of transcripts of YouTube videos to train a generative AI model without notifying or compensating video owners.
In a complaint filed last Friday in the U.S. District Court for the Northern District of California, lawyers for Massachusetts-based YouTube user David Millett allege that OpenAI secretly transcribed Millett and other creators' videos to train models that power its AI-powered chatbot platform ChatGPT and other AI-generated tools and products. By collecting this data, OpenAI “significantly profited” from the creators' work, but violated copyright law and YouTube's terms of service, which prohibit the use of the videos in apps unrelated to YouTube's service, according to the complaint.
“As [OpenAI’s] As AI products become more sophisticated through the use of training datasets, they become more valuable to potential and current users who purchase subscriptions to access them. [OpenAI’s] “This is an OpenAI AI product,” the complaint reads, “but much of the material contained in OpenAI's training datasets comes from works that OpenAI copied without consent, credit, or compensation.”
Millett, who is represented by the law firm Burser & Fisher, is seeking a jury trial and more than $5 million in damages from all YouTube users whose data may have been collected during OpenAI's training.
Generative AI models like OpenAI's have no actual intelligence: you feed them a huge number of examples (movies, audio recordings, essays, etc.) and the model “learns” the likelihood of the data occurring based on patterns (including the context of the surrounding data).
Most models are trained with data taken from public websites or web datasets. Companies argue that fair use protects their efforts to indiscriminately collect data and use it to train commercial models. But many copyright holders disagree and have filed lawsuits to block the practice.
In a sense, video transcriptions have become an important component of training data as other data sources become scarce.
According to data from Originality.AI, more than 35% of the world's top 1,000 websites currently block OpenAI's web crawlers. And a study by MIT's Data Provenance Initiative found that roughly 25% of “high-quality” sources of data are restricted from major datasets used to train AI models. If current trends in access blocking continue, research group Epoch AI predicts that developers will run out of data to train generative AI models between 2026 and 2032.
The New York Times reported in April that OpenAI created its first speech recognition model, Whisper, to transcribe video audio and gather additional training data. According to the Times, the OpenAI team, including OpenAI president Greg Brockman, used Whisper to transcribe more than 1 million hours of video from YouTube and then used the transcripts to train OpenAI's text generation and analysis model, GPT-4.
According to the Times, some OpenAI staffers discussed how such a move could violate YouTube's rules.
In July, Proof News reported that companies including Anthropic, Apple, Salesforce, and Nvidia had trained generative AI models using a dataset called The Pile, which contains hundreds of thousands of subtitles for YouTube videos. Many YouTube creators whose subtitles were included in The Pile were unaware of this or consented to it. Apple later issued a statement saying it had no intention of using these models for AI features in its products.
YouTube's parent company, Google, is also looking to use transcripts to train models.
Last year, Google expanded its Terms of Service (ToS) to allow it to use more user data to train its generative AI models. The old ToS left it unclear whether Google could use YouTube data to build products outside of its video platform. The new ToS is much more open and clear.
We have reached out to OpenAI and Google for comment on the class action lawsuit and will update this article if we hear back.
It's been a rough start to the month for OpenAI.
Tesla and X CEO Elon Musk filed a new lawsuit against OpenAI and CEO Sam Altman on Monday, accusing the company of abandoning its non-profit mission by reserving some of its most sophisticated technology for commercial customers. Musk made the same claims in a lawsuit he filed against OpenAI in February, but the new suit similarly alleges OpenAI has engaged in fraudulent practices.