When AI learns to think in three dimensions: world models, robotics, and the physics problem
World models could revolutionise robotics by teaching AI physics. But data bottlenecks and infrastructure limits delay real-world deployment.
When Yann LeCun left Meta in November 2025 to launch his own AI world model startup, reportedly seeking a $3.5 billion valuation, it signalled something fundamental: one of AI’s architects believes the next breakthrough won’t come from bigger language models. Instead, world models for robotics and physical AI represent the new frontier – teaching machines to think in three dimensions.
Large language models transformed how we work with abstract knowledge. They’re brilliant wordsmiths – eloquent, knowledgeable, capable of remarkable reasoning. But as Fei-Fei Li puts it, they remain “wordsmiths in the dark” – ungrounded in physical reality. LLMs can explain quantum physics, but can’t tell you how far apart two objects are in an image. They can write screenplays, but can’t mentally rotate a cube.
The next wave is world models: AI systems that understand how the physical world works, predict what happens when objects interact, and reason about space and time. This architectural shift – from predicting the next word to simulating physics – is what makes physical AI possible. It’s also why robotics remains so stubbornly difficult to scale.
As LeCun told Big Technology podcast earlier in 2025, “we are not going to get to human-level AI just by scaling LLMs” – they cannot achieve that milestone because they simply predict text rather than truly understand the world.
From words to worlds
World models learn by creating internal simulations of how reality operates. Rather than generating every pixel or predicting every detail, they build abstract representations of physical dynamics. Think of it as the difference between describing how a ball bounces versus understanding the physics well enough to predict where it will land.
The technical foundation is built on several converging approaches. Yann LeCun’s Joint Embedding Predictive Architecture (JEPA) proposes systems that predict abstract representations rather than raw outputs. Meta’s V-JEPA demonstrates this by predicting masked portions of video not as pixels, but as higher-level concepts – what’s happening, not what it looks like.
Fei-Fei Li frames this as “spatial intelligence” – the ability to perceive, reason about, and interact with physical spaces in three dimensions. Her company, World Labs, released Marble, which generates explorable 3D environments from images or text. Nvidia’s Cosmos platform provides world models specifically for training physical AI systems. Google DeepMind’s Genie 3 generates interactive virtual worlds from text prompts that can be explored in real time.
The investment signals conviction. World Labs raised over $230 million. Nvidia, Google, and Meta all released world model platforms.
Why robots can’t just read the manual
Here’s the connection most coverage misses: robotics has always been bottlenecked by the inability to accurately simulate the real world. Robotics systems trained in pristine simulations consistently fail when deployed in messy reality. This is the “sim-to-real gap” – the performance degradation that occurs when you transfer learned behaviours from simulation to physical hardware.
The gap exists because simulation makes compromises.
- It simplifies physics, particularly friction, deformation, and contact dynamics.
- It models sensors imperfectly.
- And it assumes predictable environments when reality is inherently noisy and variable.
A robot that learns to grasp objects in simulation might fail on a real factory floor because simulated friction coefficients don’t match those of actual materials, lighting varies, or objects aren’t positioned with millimetre precision.
World models promise to narrow this gap by learning more accurate internal simulators. Rather than hand-coding physics parameters, they learn them from data – specifically, video data showing how objects actually behave. 1X’s world model for its Neo humanoid can watch videos and learn new tasks it wasn’t explicitly programmed to perform. The robot learns by observation, like a human might, rather than requiring explicit programming for each new skill.
From vision to movement
But there’s a fundamental data problem. We have billions of images and hours of video for training vision models. For robots – particularly humanoid robots with dozens of joints – we lack corresponding “action data” that maps what movements produce what physical outcomes. As Vincent Sitzmann, an MIT assistant professor and World Labs researcher, explained, self-driving cars have limited inputs (steering, throttle, brakes), making it feasible to collect millions of hours of matched video and action data. Humanoid robots have exponentially more degrees of freedom and vastly less training data:
A humanoid robot has all these other joints and actions that they can take. And we don’t have data for that.
World models trained on passive video can help robots understand what should happen, but connecting that understanding to how the robot’s motors should move remains unsolved. You can teach a robot to recognise a cup; teaching it the precise sequence of joint movements to grasp that cup reliably across variations in cup size, weight, material, and position is a different order of difficulty.
The engineering reality
This is where physical AI hits the same wall as every other AI application: infrastructure constraints, data scarcity, and the unglamorous work of engineering at scale.
The most immediate applications of world models won’t be robotics. They’ll be gaming, where fidelity matters less than creative possibility. PitchBook predicts the market for world models in gaming could reach $276 billion by 2030, driven by the ability to generate interactive environments. When realism is aesthetic rather than functional, the sim-to-real gap doesn’t exist.
For world models in robotics applications, the timeline extends further. Creative tools are emerging now. Robotics represents what Li calls a “mid-term horizon” as systems refine the loop between perception and action. The most transformative scientific applications – drug discovery simulations, medical training environments – remain years away, requiring not just visual realism but physical accuracy.
As Li writes in her spatial intelligence manifesto:
Without spatial intelligence, AI is disconnected from the physical reality it seeks to understand. It cannot effectively drive our cars, guide robots in our homes and hospitals, or accelerate discovery in materials science and medicine.
The data advantage
The challenge mirrors the infrastructure constraints we’ve seen across AI deployment. Just as AI progress shifted from algorithmic innovation to compute availability, robotics now confronts similar physical limits. Better world models help, but they don’t eliminate the need for real-world data collection, safety testing, and iterative refinement. Domain randomisation – training on vast variations of simulated conditions – improves robustness but makes learning harder and more computationally expensive.
The robotics companies scaling fastest aren’t the ones with the cleverest algorithms. They’re the ones collecting proprietary action datasets: Tesla capturing billions of miles of driving data, Figure gathering humanoid manipulation sequences, Boston Dynamics logging decades of locomotion experience. The intellectual property isn’t just the model architecture; it’s the data pipeline connecting simulation to reality.
From demos to deployment
What does this mean for the physical AI wave being announced at every conference?
First, expect diverging timelines. World model applications requiring only visual plausibility will proliferate quickly. Architectural visualisation, game design, VR content creation – these deploy now. Applications requiring physical accuracy operate on longer cycles. An architect can work with a 3D model that’s 90% accurate; a surgical training simulator cannot.
Second, recognise that the bottleneck has shifted. Ten years ago, the challenge was algorithmic: we couldn’t build systems that learned from visual data. Five years ago, it was computational: we couldn’t train large enough models. Today, it’s data: world models for robotics need action data we don’t yet have at scale.
Third, understand that simulation remains essential but insufficient. World models don’t eliminate the need for real-world testing; they make simulation more useful by reducing the sim-to-real gap. But “reducing” isn’t “eliminating.” Every robotics deployment still requires extensive real-world validation, iterative refinement, and acknowledgement that edge cases will always emerge that no simulation anticipated.
The transformation from language models to world models represents genuine architectural progress. AI that understands spatial relationships, predicts physical interactions, and reasons about three-dimensional environments unlocks applications that pure language models cannot approach.
But as with every previous wave, the infrastructure to deploy these capabilities lags behind the algorithms to generate them. Training robots in simulated factories is possible. Generating photorealistic 3D environments from text is achievable. Building systems that predict physical dynamics from video works in the lab.
Scaling those capabilities to millions of deployed robots operating reliably in uncontrolled environments? That’s still an engineering problem – one measured in decades of data collection, billions in infrastructure investment, and thousands of edge cases discovered only through deployment. The physics of the real world, it turns out, remains harder to model than the physics of language.
Sources for deeper technical reading:
- Yann LeCun: A Path Towards Autonomous Machine Intelligence
- Meta: V-JEPA and the future of world models
- Fei-Fei Li: From Words to Worlds – Spatial Intelligence
- Scientific American: World Models Could Unlock the Next AI Revolution
- MIT: Sim-to-Real Transfer in Robotics
- ArXiv: Survey of Sim-to-Real Transfer in Deep RL for Robotics
Photo by Developer_Console on Pixabay