Google is eyeing OpenAI's Sora with Veo. The AI model can create 1080p video clips approximately one minute long when given a text prompt.
Veo, announced Tuesday at Google's I/O 2024 developer conference, can capture a variety of visual and cinematic styles, including landscapes and time-lapse shots, and make edits and adjustments to the footage already produced. .
“To see what Veo can do, we're looking at features like storyboards and longer scene generation,” Demis Hassabis, director of Google's AI research and development lab DeepMind, told reporters during a virtual roundtable. “I'm working on it,” he said. “We've made incredible progress with video.”
Image credit: Google
Veo builds on Google's preliminary commercial work on video generation, which was previewed in April, and leverages the company's Imagen 2 family of image generation models to create looping video clips.
But unlike Imagen 2-based tools, which could only create videos a few seconds long at low resolution, Veo supports today's leading video generation models (Sora as well as those from startups like Pika, Runway, and Ireverent). ) seems to be competitive. Laboratory.
During the briefing, Douglas Eck, who leads the generative media research effort at DeepMind, provided a few select examples of what Veo can do. In particular, he said, the aerial shots of busy beaches demonstrated his Veo's strengths over competing video models.
“Representing the details of all the swimmers on the beach is proving difficult for both image- and video-generating models, because there are so many moving characters,” he said. . “If you look closely, the waves seem to be pretty good. And I think the immediate meaning of the word 'busy' is captured by all the people along the vibrant coast full of sunbathers. I insist.”
Image credit: Google
Veo was trained with a lot of footage. This is how generative AI models generally work. When FRB takes some form of data as an example, the model can detect patterns in the data and generate new data (video in Veo's case).
Where did the footage to train Veo come from? Eck wouldn't say exactly, but he acknowledged that some may have been sourced from Google's own YouTube.
“Google models may be trained on some YouTube content, but always according to agreements with YouTube creators,” he said.
The “agreement” part may be technically correct. But it's also true that given YouTube's network effects, creators have no choice but to follow Google's rules if they want to reach the widest possible audience.
Image credit: Google
An April report in the New York Times revealed that Google expanded its terms of service last year to allow it to use more data to train its AI models. Under the old terms of service, it wasn't clear whether Google could use YouTube data to build products beyond its video platform. The new conditions do not, and the reins are loosened considerably.
Google isn't the only tech giant leveraging vast amounts of user data to train its internal models. (See: Meta.) But what is sure to disappoint some creators is Eck's claim that Google sets the “gold standard” when it comes to ethics.
“The solution to this is [training data] “We will find challenges by bringing all stakeholders together to consider next steps,” he said. “Until we take these steps with the stakeholders – the film industry, the music industry, the artists themselves – we cannot act quickly.”
But Google has already made Veo available to select creators, including Donald Glover (aka Childish Gambino) and his creative agency, Gilga. (Like his OpenAI with Sora, Google is positioning his Veo as a tool for creativity.)
Eck noted that Google provides tools for webmasters to prevent the company's bots from collecting training data from their websites. However, this setting doesn't apply to YouTube. Google also does not provide a mechanism for creators to remove their work from the training data set after it has been scraped, unlike some rivals.
I also asked Mr. Eck about reflux. Backflow, in the context of generative AI, refers to a model producing a mirror copy of its training samples. Tools like Midjourney have been found to spit out accurate still images from movies like “Dune,'' “The Avengers'' and “Star Wars,'' for which they provided timestamps, exposing users to potential legal landmines. I'm laying the ground. OpenAI reportedly went so far as to block Sora's trademarks and creator names in an attempt to avoid copyright infringement claims.
So what steps has Google taken to reduce the risk of backflow from Veo? Eck said the research team implemented filters for violent and explicit content (i.e. no pornography). , didn't have an answer other than stating that it uses DeepMind's SynthID technology to mark videos from Veo as AI-generated.
Image credit: Google
“For something big like the Veo model, we're going to focus on a gradual release to a small number of stakeholders who can work very closely to make sense of the model, and then roll it out. “To a larger group,” he said.
Eck had more to share about the model's technical details.
Eck described Veo as “fairly controllable” in the sense that the model understands camera movements and VFX fairly well from prompts (descriptions such as “pan,” “zoom,” and “explosion” child). And like Sora, Veo has some understanding of physics such as fluid mechanics and gravity. These contribute to the realism of the video you produce.
Veo also supports mask editing to change specific areas of a video, and generative models like Stability AI's Stable Video allow you to generate video from still images. Perhaps most interestingly, Veo can generate longer videos (over a minute) given a series of prompts that tell a story.
Image credit: Google
That doesn't mean Veo is perfect. Reflecting the limitations of today's generative AI, objects in Veo's videos disappear and reappear without much explanation or consistency. And Veo often gets its physics wrong. For example, cars inexplicably and inexplicably back up in about 10 cents.
As such, Veo will remain on the waiting list for the foreseeable future at Google Labs, the company's experimental technology portal, within a new front end for generative AI video creation and editing called VideoFX. As the model improves, Google aims to bring some of its features to YouTube Shorts and other products.
“This is very much a work in progress, very experimental… there's a lot more left unanswered than what's been done here,” Eck said. “But I think this is kind of the raw material for doing really great things in the field of filmmaking.”
Publish an AI newsletter. Sign up here to start receiving it in your inbox on June 5th.