OpenAI hasn't disclosed exactly what data it used to train Sora, its video generation AI. But from the looks of it, at least some of the data may have come from Twitch streams and game walkthroughs.
Sora was released on Monday and I've been playing around with it a bit (as capacity issues allow). Sora can generate videos up to 20 seconds long in various aspect ratios and resolutions from text prompts or images.
When OpenAI first unveiled Sora in February, it hinted at the fact that it trained the model on Minecraft videos. So I wondered what other video game playthroughs might be lurking in the training set.
It seems like quite a number.
Sora can produce videos of what is essentially a clone of Super Mario Bros. (with glitches).
Image credit: OpenAI
Create gameplay footage for first-person shooter games inspired by Call of Duty and Counter-Strike.
Image credit: OpenAI
And you can spit out a clip showing an arcade fighter in the style of the 90s Teenage Mutant Ninja Turtle games.
Image credit: OpenAI
Sora also seems to have an idea of what a Twitch stream should look like, hinting that he's watched a few. Check out the screenshot below to get an accurate idea of the rough strokes.
Screenshot of a video generated using Sora. Image credit: OpenAI
Another notable thing about the screenshot is that the likeness of popular Twitch streamer Raúl Álvarez Genes, known by the name Auronplay, can be seen right down to the tattoo on Genes' left forearm.
Auronplay isn't the only Twitch streamer that Sora seems to “know.” This produced a video of a character similar in appearance (with some artistic freedom) to Imane Anys, better known as Pokimane.
Image credit: OpenAI
Admittedly, I had to get creative with some of the prompts (like the “Italian Plumber Game”). OpenAI implemented filtering to prevent Sora from producing clips depicting trademarked characters. For example, if you type “Mortal Kombat 1 gameplay,” you won't see anything similar to the title.
However, my testing suggests that the game's content may be infiltrating Sora's training data.
OpenAI has been cautious about where it gets its training data from. In an interview with the Wall Street Journal in March, OpenAI's then-CTO Mira Murati did not completely deny that Sola had been trained on content from YouTube, Instagram, and Facebook. . And OpenAI acknowledged in Sora's technical specifications that it used “publicly available” data and licensed data from stock media libraries like Shutterstock to develop Sora.
OpenAI also did not respond to a request for comment.
If game content is indeed included in Sora's training set, there could be legal implications, especially if OpenAI builds more interactive experiences on top of Sora.
“Companies that use unauthorized footage of video game playthroughs for training are at a lot of risk,” Joshua Weigensberg, an intellectual property attorney at Pryor Cashman, told TechCrunch. Ta. “Training a generative AI model typically involves a copy of the training data. If that data is playthrough video of a game, there is an overwhelming chance that the training set contains copyrighted material. It will be more expensive.”
probabilistic model
Generative AI models like Sora are probabilistic. They are trained on large amounts of data and learn patterns in that data to make predictions. For example, it predicts that when a person bites into a hamburger, there will be a bite mark.
This is a useful property. This allows the model to “learn” to some extent how the world works by observing it. But it can also be your Achilles heel. When prompted in a certain way, models (many of which are trained on public web data) produce near-copies of the training samples.
This is a sample from Sora. Image credit: OpenAI
This is understandably upsetting to creators whose work has been leaked during training without their permission. More and more people are seeking redress through the court system.
Microsoft and OpenAI are currently being sued for allegedly allowing license codes to be passed back into their AI tools. Midjourney, Runway, and Stability AI, makers of popular AI art apps, are the focus of a lawsuit accusing them of violating artists' rights. Additionally, major music labels are suing Udio and Suno, two startups that are developing AI-powered music generation devices, for copyright infringement.
Many AI companies have long advocated fair use protections, arguing that their models produce innovative work rather than plagiarism. For example, Suno argues that random training is no different than “a kid listening to that genre and writing his own rock song.”
But game content has unique considerations, said Evan Everist, a lawyer specializing in copyright law at Dorsey & Whitney.
“Playthrough videos include at least two layers of copyright protection: game content owned by the game developer, and proprietary video created by the player or videographer to capture the player's experience. ” Everist told TechCrunch via email. “Also, for some games, there may be a third layer of rights in the form of user-generated content that appears in the software.”
Everist cited the example of Epic's Fortnite, which allows players to create their own game maps and share them for others to use. A video playthrough of one of these maps will involve at least three copyright owners: (1) Epic, (2) the user of the map, and (3) the creator of the map; he said.
This is a sample from Sora. Image credit: OpenAI
“If a court finds copyright liability for training AI models, each of these copyright holders could become a plaintiff or license source,” Everist said. “For developers training AI on videos like this, the risk exposure increases exponentially.”
Weigensberg noted that the games themselves have many “protectable” elements, such as unique textures, that judges might consider in IP cases. “Unless these works are properly licensed, training on these works may be copyright infringement,” he said.
TechCrunch reached out to a number of game studios and publishers for comment, including Epic, Microsoft (which owns Minecraft), Ubisoft, Nintendo, Roblox, and cyberpunk developer CD Projekt Red. Few responded, and none gave a statement on the record.
A spokesperson for CD Project Red said, “We are unable to respond to interviews at this time.'' EA told TechCrunch that it has “no comment at this time.”
Risky output
AI companies could win in these legal disputes. Following the precedent of the publishing industry's lawsuit against Google nearly a decade ago, a court could decide that generative AI has “very compelling transformative purposes.”
In this case, the court ruled that it was permissible for Google to copy millions of books for Google Books, a type of digital archive. Authors and publishers were trying to argue that copying their intellectual property online constituted infringement.
However, a ruling in favor of AI companies does not necessarily protect users from accusations of wrongdoing. If a generative model regurgitates a copyrighted work, those who publish it or incorporate it into another project may still be held liable for intellectual property infringement.
“Generative AI systems often spit out recognizable and protectable IP assets as output,” Weigensberg said. “Simple systems that produce text or static images often struggle to prevent copyrighted material from being produced in the output, so whatever the programmer's intentions, more complex The same problem may occur on your system.
This is a sample from Sora. Image credit: OpenAI
Some AI companies have indemnification clauses in place should this situation arise. However, clauses often include carve-outs. For example, OpenAI applies only to corporate customers and not to individual users.
Weigensberg says there are other risks to consider besides copyright, such as trademark infringement.
“That output may also include assets used in connection with marketing and branding, such as recognizable characters in games, creating trademark risk,” he said. “Otherwise, the output could pose name, image, and publicity rights risks.”
All this can be further complicated by the growing interest in world models. One of the applications for the world model that OpenAI sees as Sora is to essentially generate video games in real time. If these “synthetic” games resemble the content used to train the model, there may be legal issues.
“Training an AI platform on video game sounds, movement, characters, songs, dialogue, and artwork constitutes copyright infringement, just as it would if these elements were used in any other context,” McKool said. said Avery Williams, an IP trial attorney at . Mr. Smith said. “The questions surrounding fair use raised by so many lawsuits against generative AI companies will impact the video game industry as well as other creative markets.”