World models (also known as world simulators) are being touted by some as the next big thing in AI.
AI pioneer Fei-Fei Li's World Labs raises $230 million to build “large-scale world models,” and DeepMind invests in Sora, one of the creators of OpenAI's video generator. I hired him to develop the “World Simulator.” (Sora was released on Monday. Here are some early impressions.)
But what exactly are these?
World models are inspired by the mental models of the world that humans naturally develop. Our brains take abstract representations from our senses and shape them into more concrete understandings of the world around us, creating what we call “models” long before AI adopted the term. will generate. The predictions our brains make based on these models influence how we perceive the world.
A paper by AI researchers David Ha and Jürgen Schmidhuber uses the example of a baseball batter. It takes a batter a few milliseconds to decide how to swing the bat, which is shorter than the time it takes for visual signals to reach the brain. Ha and Schmidhuber say they can hit 160 mph fastballs because they can instinctively predict where the ball will go.
“For professional players, this is all happening subconsciously,” the researchers wrote. “Their muscles reflexively swing the bat at the right time and place, according to the predictions of their internal models. They don't have to consciously develop and plan for possible future scenarios; We can act quickly on predictions.”
Some believe that these subconscious reasoning aspects of the world model are a prerequisite for human-level intelligence.
model the world
Although the concept has been around for decades, world models have recently gained popularity, in part due to their promising applications in the field of generative video.
Most, if not all, AI-generated videos veer into uncanny valley territory. If you observe them for a long time, strange things will happen, such as their limbs twisting and merging.
A generative model trained on years of videos may accurately predict that a basketball will bounce, but just as a language model doesn't actually understand the concepts behind the words or phrases. Like, I don't actually know why. But a world model that even fundamentally understands why a basketball bounces the way it does would be better at showing that movement.
To enable this kind of insight, world models use pictures, sounds, It is trained on various data such as videos, text, etc. .
Sample Gen-3 video generation model from AI startup Runway. Image credit: Runway
“Viewers expect the world they see to behave in the same way as their reality,” said Higgsfield, former head of AI at Snap Inc., who builds video generation models. Alex Mashrabov, CEO of “When a feather falls under the weight of an anvil or a bowling ball flies hundreds of feet into the air, it creates a shock that takes the viewer out of the moment. Rather than defining how each object is expected to move (which is tedious, tedious, and a poor use of time), the model will figure this out.
But better video generation is just the tip of the iceberg for the world's models. Researchers, including Meta's chief AI scientist Yann LeCun, say the model could one day be used for advanced prediction and planning in both the digital and physical realms.
In a talk earlier this year, LeCun explained how world models can help achieve desired goals through inference. A model with a basic representation of the “world” (such as a video of a dirty room) is given a goal (a clean room) and a series of actions (such as placing a vacuum cleaner to clean the room) to achieve that goal. You can come up with things like cleaning. Not because it's an observed pattern (washing the dishes, emptying the trash can), but because on a deeper level we know how to go from a dirty state to a clean state.
“We need machines that understand the world. [machines] They can remember things, they have intuition, they have common sense, they can reason and plan at the same level as humans,” LeCun said. “As you may have heard from the most enthusiasts, current AI systems simply don’t have these capabilities.”
LeCun estimates that the world model he envisions is at least a decade away, but today's world model shows promise as an elementary physics simulator.
Sora controls players and renders the world in Minecraft. Image credit: OpenAI
In a blog post, OpenAI points out that Sora, which the company considers a world model, can simulate the movement of a painter leaving handwriting on a canvas. Models like Sora, and Sora itself, can effectively simulate video games. For example, Sora can render Minecraft-like UIs and game worlds.
World Labs co-founder Justin Johnson said on an episode of the a16z podcast that future world models may be able to generate 3D worlds on demand for things like gaming and virtual photography.
“We already have the ability to create virtual, interactive worlds, but that takes billions and hundreds of millions of dollars and tons of development time,” Johnson said. “[World models] Instead of just getting images and clips, you'll now be able to get fully simulated, vibrant, interactive 3D worlds. ”
hurdle is high
Although this concept is attractive, many technical challenges stand in the way.
Training and running a world model requires an enormous amount of computational power, even compared to the amount currently used in generative models. While some modern language models can run on modern smartphones, Sora (probably an early world model) requires thousands of GPUs to train and run, especially if its use becomes commonplace.
Like all AI models, world models create illusions and internalize biases in their training data. A world model trained primarily on sunny videos of European cities might have a hard time understanding or depicting, for example, a snowy Korean city, or it might simply represent it incorrectly.
Mashrabov said a general lack of training data could exacerbate these problems.
“We've seen models become very specific to certain types and ethnicities and generations,” he said. “The training data for the world model needs to be broad enough to cover a wide variety of scenarios, but also very specific to the extent that the AI can deeply understand the nuances of those scenarios.”
Cristóbal Valenzuela, CEO of AI startup Runway, said in a recent post that data and engineering issues prevent today's models from accurately capturing the behavior of the world's inhabitants, including humans and animals. . “Models will need to produce consistent maps of their environments, and they will also need the ability to move and interact within those environments,” he said.
Video created by Sora. Image credit: OpenAI
But if all major hurdles can be overcome, world models can bridge AI and the real world “more robustly,” leading to breakthroughs not only in virtual world generation but also in robotics and AI decision-making, Mashrabov said. He is thinking.
It also has the potential to create more capable robots.
Today's robots are not aware of the world around them (or of their own bodies), so they are limited in what they can do. Mashrabov said the global model could give them that recognition, at least to some extent.
“With advanced world models, AI could develop a personal understanding of what scenarios it is in and begin to reason about possible solutions,” he said. .
TechCrunch has a newsletter focused on AI. Sign up here to get it delivered to your inbox every Wednesday.