No one knows yet what generative video models are good for, but that doesn't stop companies like Runway, OpenAI, and Meta from pouring millions of dollars into their development. The latest version of Meta is called Movie Gen, and as the name suggests, it turns text prompts into relatively realistic videos with audio. But thankfully there is no audio yet. And wisely, they don't release this to the public.
Movie Gen is actually a collection of basic models (or “casts” as they call them), the largest of which is the text-to-video bit. Meta claims to outperform the likes of Runway's Gen3, LumaLabs' latest work, and Kling1.5, but as always, this kind of thing is less about Movie Gen winning and more about them. This is to show that they are playing the same game. Technical details can be found in the documentation published by Meta describing all components.
The audio is generated to match the content of the video, such as the sound of an engine as the car moves, the force of a waterfall in the background, or thunder added as needed during the video. Also add music if you think it's relevant.
It was trained on a “combination of licensed and publicly available datasets,” which they call “proprietary/commercially sensitive,” without providing further details. It was. Our best guess is that in addition to the large amount of Instagram and Facebook videos, there are also a large number of partner videos and other videos that are not well protected from scrapers, i.e. “public” videos. .
But Meta's clear goal here is not just to earn the “state-of-the-art'' accolade for a month or two, but to create a practical product that can produce a solid end product from a very simple process. , a soup-to-nuts approach. , natural language prompts. Something like, “Imagine me as a baker making a shiny hippo cake in a thunderstorm.”
For example, one of the problems with these video generators is that they are usually very difficult to edit. If you request a video of someone walking across the street and then realize you want them to walk right to left instead of left to right, repeating the prompt with that additional instruction will make the whole shot look different. There is a high possibility that it will look like this. Meta adds a simple text-based editing method. Just say “change the background to a busy intersection” or “change her outfit to a red dress” and it will attempt that change, but only that change will happen.
Image credit: Meta
Camera movement is also commonly understood, and things like “tracking shots” and “left pans” are taken into account when generating video. This is still pretty clunky compared to actual camera control, but it's much better than doing nothing.
The model limitations are a bit strange. Generates a video that is 768 pixels wide. This is the size most people are familiar with in the famous but outdated 1024×768, but it's also three times as large as 256, so it plays well in other HD formats as well. The Movie Gen system upscales this to 1080p. This is the basis for the claim to produce that resolution. Not really, but upscaling is surprisingly effective, so I'll give it a pass.
Oddly enough, it produces up to 16 seconds of video…16 frames per second, a frame rate that no one in history has ever wanted or demanded. However, you can also run 10 second videos at 24 FPS. Please lead with that!
As for why there is no audio…well, there are probably two reasons. First of all, it's super difficult. Generating speech is now easy, but matching it to lip movements, and those lip and facial movements, is a much more complex proposition. This was just a momentary failure, so I don't blame them for leaving it until later. Someone might say, “I'm going to generate a clown that rides around in circles on a small bicycle and delivers the Gettysburg Address.'' This is nightmare fuel and can spread quickly.
The second reason is probably political. Releasing what amounts to a deepfake generator a month before a major election is…not in the best interest of optics. A practical precaution is to limit its functionality a bit so that if a malicious attacker tries to use it, it requires some real work on their part. Sure, you can combine this generative model with a voice generator or an open lip sync model, but you can't just have it generate candidates who make outlandish claims.
“Movie Gen is purely an AI research concept at this point, and even at this early stage, safety is our top priority, as with all of our generative AI technology,” a Meta representative told TechCrunch when asked. He answered and spoke.
For example, unlike Llama's large language model, Movie Gen is not publicly available. Although the technique can be reproduced to some extent by following the research paper, the code is not made public except for the “underlying assessment prompt dataset”, i.e. the recording of the prompts used to generate the test videos.