Though several developers want to build models that see the world for what it is, these systems are still in very early stages.
Progress has been made on visual understanding, but humans have more than one sense. For a world model to be truly complete, developing a system with a strong understanding of audio, touch and physical interaction is crucial. The ideal world model not only understands all those modalities but can also create simulations in any of them. “If a modality is missing, the simulation will always be incomplete,” said Liu.
Creating an AI that can understand all of those modalities is to create a model that senses and understands almost entirely like a human does. But doing so comes with significant technical barriers, including access to substantial amounts of complex training data and potentially the need for entirely new model architecture.
But surpassing those barriers could have far-reaching implications, said Liu.
In robotics, these models can prevent the need for intense monitoring and training, limiting “real-world trial and error,” Liu said. Instead, the models that operate robots could be trained in simulations, perfecting actions and discovering mistakes before they even get onto factory floors or into homes. In self-driving cars, meanwhile, a world model could allow an autonomous driving model to rehearse thousands of traffic scenarios before the rubber hits the road.
And the possibilities extend beyond the self-piloted machines we have available today, with research being done in domains as sports strategy to simulate player outcomes, animation and digital art to design and create worlds, said Liu. More discoveries could emerge once these models are actually in the hands of the people.
“In the end, it’s about creating AI that doesn’t just react to the world but can think ahead.”