中文

World Models and Real-World Intelligence · Yann LeCun

2026-06-11 · A faithful, transcript-grounded reading by PodLens

Original episode:https://youtu.be/72Xj8k5WQX4?si=eVD7EfrtsPRE4sOC · Timestamps are clickable — they seek the player in place

World ModelsSelf-Supervised LearningEmbodied IntelligenceEnergy-Based ModelsJEPA

What This Episode Is About

In this lecture, Yann LeCun explores the core role of World Models as the enabler of the next generation of the artificial intelligence revolution. Yann LeCun points out that current Large Language Models (LLMs) based on auto-regressive architectures have fundamental flaws when dealing with high-dimensional, continuous, and noisy real physical world data, making them unable to truly acquire human- or animal-level physical common sense and autonomous planning capabilities. He systematically explains how to achieve non-generative self-supervised learning through Joint Embedding Predictive Architecture (JEPA) and Energy-Based Models (EBM), avoiding pixel-level generation in high-dimensional spaces, and discusses how to prevent representation collapse through information maximization (such as the SIGREG algorithm and distillation methods). Finally, he introduces his newly founded company, Emmy Labs, which aims to apply Physical AI and control technologies to robotics and industrial processes.

Timeline Topic Map

Core Viewpoints List

  1. Current mainstream machine learning architectures face severe bottlenecks in sample efficiency and common sense acquisition * Evidence Anchor: [00:00 - 00:38] * Type: Fact * Note: Compared to humans and animals, machine learning models learn extremely slowly when facing new tasks and lack zero-shot adaptation capabilities.
  2. Text-only training cannot bring about true human-level intelligence * Evidence Anchor: [10:50 - 11:52] * Type: Opinion * Note: The 20-30 trillion tokens consumed in LLM training would take a human 400,000 years to read, whereas the total amount of physical world information a four-year-old child obtains through vision is equivalent to it. Pure text lacks the physical world mapping of embodied common sense.
  3. The essence of intelligence is adaptability rather than the accumulation of declarative knowledge * Evidence Anchor: [04:12 - 05:24] * Type: Opinion * Note: Jean Piaget's ideas show that intelligence is not the accumulation of declarative knowledge or specific skills, but rather the adaptability to cope with unknown situations and the ability to quickly acquire new skills.
  4. Reasoning based on energy optimization search is more powerful than direct feedforward computation * Evidence Anchor: [11:58 - 13:32] * Type: Opinion * Note: Searching during the reasoning phase for action outputs that minimize the energy function provides a higher computational and reasoning ceiling than directly running a feedforward neural network with a fixed number of layers.
  5. World models should not be generative pixel-level prediction systems * Evidence Anchor: [19:54 - 21:18] * Type: Prediction * Note: Redundant information and unpredictability in videos doom pixel-level prediction to failure (such as producing blurry predictions or only predicting the mean). The correct path is to make predictions in the abstract Representation Space.
  6. Hierarchical planning is the most central unsolved problem in the current field of agents and robotics * Evidence Anchor: [17:26 - 19:54] * Type: Opinion * Note: How to decompose long-term macro goals (such as traveling to Paris) into sub-goals at various levels that do not require real-time fine planning (such as walking to the elevator, pushing the button) has not yet been systematically solved by anyone.
  7. Preventing representation collapse in self-supervised learning must rely on effective information maximization regularization methods * Evidence Anchor: [26:05 - 27:31] * Type: Fact * Note: In non-contrastive methods, encoder outputs of constant values can be avoided by maximizing the information entropy between the dimensions of the representation vectors (making each dimension independent of the others).
  8. Reinforcement learning is extremely inefficient and its use should be minimized on the basis of complete feature representation * Evidence Anchor: [52:21 - 52:58] * Type: Opinion * Note: RL sample efficiency is extremely low and should be used as a last resort when all else fails. Most learning should build a world model through observation, and then use RL at the top level after obtaining excellent representations.

Internal Tensions and Self-Corrections

Plain English Retelling

Imagine you hired an extremely smart assistant who has absolutely no common sense about life. He has read every book in human history, can recite all the laws of physics, and can even write beautiful term papers. However, when you place an apple in front of him and remove its support, he doesn't know that the apple will fall—unless you explicitly wrote about it in a book. This is the current state of Large Language Models (LLMs) today: they possess a massive amount of "declarative knowledge" but know nothing about the physical world.

Yann LeCun issues a sharp warning: stop expecting to reach human-level AI by continuing to scale up LLMs (stacking computing power and feeding more data). A four-year-old child doesn't need to read 400,000 years of books to learn how to walk and avoid obstacles. Just during their waking hours, by staring at the world with their eyes, they receive an absurdly massive amount of visual data (about 10 to the 14th power bytes). This is equivalent to all the text data on the internet. Infants build a "world model" in their brains through passive observation.

What is this world model useful for? It can make "broad-direction predictions." For example, if you are in your office at NYU and want to plan a trip to Paris tomorrow. You don't plan in your head how your muscles should move every microsecond or how many centimeters your left foot should step forward. Instead, your world model gives rough steps: go to the airport, take a plane. In this process, concrete details (like going downstairs, waiting for a taxi, pressing the elevator button) are dynamically planned at different levels. This is hierarchical planning. In contrast, existing generative models, like Sora or various pixel prediction tools, try to precisely predict every single pixel in a video, which is as absurd as simulating the trajectory of every air molecule when designing a space shuttle. What we need is "abstraction"—filtering out unimportant pixel noise and keeping only the core structure.

The solution Yann LeCun offers is JEPA (Joint Embedding Predictive Architecture). Its brilliance lies in the fact that it does not make predictions in pixel space, but in "representation space." For example, if I show you half of a video and ask you to predict what happens next, JEPA won't try to draw everyone's face and the water cup on the table; instead, it predicts the abstract meaning that "this person will walk toward the podium." This not only saves an immense amount of computing power but also allows the model to capture true physical laws and causal chains.

So, how do we train this model without letting it "slack off" (i.e., causing representation collapse, where it only outputs all zeros or identical content)? Yann LeCun introduces a new method called SIGREG (Isotropic Gaussian Regularization). By projecting in multiple directions, it makes the data distribution resemble a uniform spherical Gaussian distribution, forcing each dimension of data to carry unique, non-redundant information.

Finally, Yann LeCun makes a highly controversial declaration: abandon pure generative models, minimize the use of reinforcement learning, and stop competing over LLMs in academia. He founded the new company Emmy Labs precisely to tackle how to make Physical AI achieve truly safe automatic planning through world models in complex industrial and robotic scenarios where formulas do not exist.

Recommended Segments for Close Listening

Resonances with past episodes

Tensions with past episodes

A faithful reconstruction and plain-language retelling of the episode, generated by PodLens.

This is one source-grounded reading, not a replacement for the original. Every point is anchored to its source, so you can check it yourself — and corrections are welcome.