OpenAI is moving into video generation, following in the footsteps of startups like Runway and tech giants like Google and Meta.
OpenAI today announced Sora, a GenAI model that creates videos from text. According to OpenAI, given a brief (or detailed) description and a still image, Sora can generate 1080p movie-like scenes with multiple characters, different types of motion, and background details. It means.
Sora can also “enhance” existing video clips, doing its best to fill in any missing details.
“Sora has a deep understanding of language, allowing it to accurately interpret prompts and generate engaging characters that express vivid emotions,” OpenAI wrote in a blog post. “This model understands not only what the user asks for in a prompt, but also how those things exist in the physical world. ”
Now, OpenAI's Sora demo page is full of hyperbole. The above description is an example.However, carefully selected samples from the model do Pretty impressive, at least when compared to other text-to-video conversion technologies we've seen so far.
First, Sora can generate videos of up to 1 minute in different styles (photorealistic, animated, black and white, etc.). This is much longer than most text-to-video models. And these videos remain reasonably consistent in the sense that they don't always succumb to what I call “AI weirdness,” such as objects moving in physically impossible directions. Masu.
Take a tour of this art gallery. All generated by Sora (ignore the graininess; it's compressed by a video to GIF conversion tool).
Or this animation of flowers blooming:
Sora's videos, which include humanoid subjects (such as a robot standing against a cityscape or a person walking on a snowy road), have a video game-like quality to them. This is probably because not much is happening. It's in the background. Other AI weirdness sneaks into many of the clips, like a car driving in one direction suddenly backing up or an arm melting into a duvet cover.
OpenAI admits that despite all its high points, its models are not perfect. It writes:
“[Sora] You may struggle to accurately simulate the physics of complex scenes, and you may not be able to understand certain instances of cause and effect. For example, even if a person bites into a cookie, there may not be a bite mark left on the cookie afterwards. The model may also confuse the spatial details of the prompt, for example confusing left and right, and may struggle to accurately describe events that occur over time, such as following a particular camera trajectory. there is. ”
OpenAI is positioning Sora as a research preview, revealing little about what data was used to train the model (less than about 10,000 hours of “high quality” video), and announcing Sora's general availability. I'm waiting. The basis for this is the potential for abuse. OpenAI rightly points out that malicious actors can exploit models like Sora in myriad ways.
OpenAI said it is working with experts to study models of exploits and build tools to detect whether a video was generated by Sora. The company also says that if you choose to incorporate the model into a public product, it will ensure that the generated output includes provenance metadata.
“We will engage policymakers, educators, and artists around the world to understand their concerns and identify positive use cases for this new technology,” OpenAI wrote. “Despite extensive research and testing, we cannot predict all the ways people will use our technology beneficially or all the ways people will misuse it. We believe that learning from the use of is a key element to creating and releasing increasingly secure AI systems over time.”