World models, also known as world simulators, are being touted by some as the next big thing in AI.
AI pioneer Fei-Fei Li’s World Labs has raised $230 million to build “large world models,” and DeepMine hired one of the creators of OpenAI’s video generator, Sora, to work on “world simulators. “(Sora was released on Monday; here are some early impressions.)
But what the hack are these things?
World models take inspiration from the mental models of the world that humans develop naturally. Our brains take the abstract representations from our senses and form them into more concrete understanding of the world around us, producing what we called “models” long before AI adopted the phrase. The predictions our brains make based on these models influence how we perceive the world.
A paper by AI researchers David Ha and Jurgen Schmidhuber gives the example of a baseball batter. Batters have milliseconds to decide how to swing their bat-shorter than the time ti takes for visual signals to reach the brain. The reason they’re able to hit a 100-mile-per -hour fastball is because they can instinctively predict where the ball will go, Ha and Schmidhuber say. researchers David Ha and Jurgen Schmidhuber gives example of a baseball batter. Batters have milliseconds to decide how to swing their bar -shorter than the time it rakes for visual signals to reach the brain. The reason they’
For professional players, this all happens subconsciously, “the research duo writes. “Their muscles reflexively swing the bat at the right time and location in line with their internal models’ predictions. they can quickly act on their predictions of the future without the need to consciously roll out possible future scenarios to form a plan.”
It’s these subconscious reasoning aspects of world models that some believe are prerequisites for human- level intelligence.
Modeling the world
While the concept has been around for decades, world models have gained popularity recently in part because of their promising applications in the field of generative video.
Most, if not all, AI-generated videos veer into uncanny valley territory. Watch them long enough and something bizarre will happen, like limbs twisting and merging into each other .
While a generative model trained on years of video might accurately predict that a basketball bounces, it doesn’t actually have any idea why-just like language models don’t really understand the concepts behind words and phrases. But a world model with even a basic grasp of why the basketball bounces like it does will be better at showing it do that thing..
To enable this kind of insight, world models are trained a range of data, including photos, audio, videos, and text, with the intent of creating internal representations of how the world works, and the ability to reason about the consequences of actions.
High hurdles
While the concept is enticing, many technical challenges stand in the way.
Training computer and running wold models requires massive compute power even compared to the amount currently used by generative models. While some of the latest language models can run on a modern smartphone, Sora (arguably an early world model) would require thousands of GPUs to to train and run, especially their use becomes commonplace.
World models, like AI models, also hallucinate-and internalize biases in their training data. A world model trained largely on videos of sunny weather in European cities might struggle to comprehend or depict Korean cities in snowy conditions, for example, or simply do so incorrectly.
A general lack of training data threatens to exacerbate these issues, says Mashrabov.
“We have seen models being really limited with generations of people of a certain type or race, ” he said. “Training data for a world model must be broad enough to cover a diverse set of scenarios, but also highly specific to where the AI can deeply understand the nuances of those scenarios.”
In a recent post, AI startup Runway’s CEO, Cristobal Valenzuela, says that data and engineering issues prevent prevent today’s models from accurately capturing the behavior of a world’s inhabitants (e.g. humans and animals). “Models will need to generate consistent maps of the environment, he said “and the ability to navigate and interact tn those environments.”