Representation Learning and Predictive World Models · Saining Xie

2026-06-04 · A faithful, transcript-grounded reading by PodLens

Original episode:https://youtu.be/rIwgZWzUKm8?si=visnqbvS_b-eqcLF　·　Timestamps are clickable — they seek the player in place

Diffusion TransformerRepresentation LearningPredictive World ModelsSelf-Supervised LearningModel Predictive Control

What This Episode Covers

This episode features an in-depth, marathon interview with young scientist and entrepreneur Xie Saining, alongside a brief guest segment with Tommy (Zhiyuan Zeng). The central focus of the discussion is Xie Saining's academic and professional journey in artificial intelligence, specifically focusing on representation learning, the development of the Diffusion Transformer (DiT), and his transition from academia to co-founding the startup AMI Labs with Turing Award winner Yann LeCun. The conversation explores the fundamental limitations of Large Language Models (LLMs) as world models, the definition of real intelligence, and the technical and philosophical roadmap toward building a predictive "world model" that can understand the physical world.

Timeline & Topic Map

[00:00:12 - 00:02:05] Introduction to the New York setting and Xie Saining's background as a young scientist co-founding AMI Labs with Yann LeCun.
[00:02:06 - 00:04:32] Xie Saining's personal reasons for choosing New York for his academic career and his self-described introverted nature.
[00:04:33 - 00:09:19] Childhood memories, family background, early exposure to computers, and his first experiences with online self-expression.
[00:09:20 - 00:13:49] Academic journey at SJTU's ACM Class, his self-perception as an ordinary student, and his relaxed approach to competition.
[00:13:50 - 00:20:41] The ACM Class interview, discovering computer vision, and the profound influence of senior student Hou Xiaodi.
[00:20:42 - 00:25:03] Choosing a research internship at NUS over MSRA to pursue his passion for computer vision, demonstrating early initiative.
[00:25:04 - 00:30:11] The biological and evolutionary significance of vision, referencing the Cambrian Explosion as a visual arms race.
[00:30:12 - 00:35:56] Experiences at NUS, publishing his first paper, and the unique, less competitive design of the ACM Class.
[00:35:57 - 00:43:31] PhD application struggles, being rescued by Tu Zhuowen, moving to UCSD, and learning from Tu's rigorous coding practices.
[00:43:32 - 00:52:20] PhD research on Deeply Supervised Nets (DSN) and Holistically-Nested Edge Detection (HED), and receiving a Marr Prize nomination.
[00:52:21 - 00:57:42] Doing five diverse internships during his PhD and learning to accept failures and non-linear progress.
[00:57:43 - 01:05:35] First internship at Meta/FAIR, collaborating with He Kaiming, and developing ResNeXt for the ImageNet challenge.
[01:05:36 - 01:09:10] Internship at DeepMind in London, researching reinforcement learning, and observing their unique organizational structure.
[01:09:11 - 01:12:16] Connecting his PhD research under the theme of representation learning with induced structural priors.
[01:12:17 - 01:17:56] Defining representation learning, the trap of Neural Architecture Search (NAS), and winning the Test of Time Award for DSN.
[01:17:57 - 01:25:18] Post-PhD career choices, rejecting OpenAI's offer from Ilya Sutskever, and joining FAIR.
[01:25:19 - 01:28:26] A second call from Ilya Sutskever in 2024, discussing AI's ability to love, and why he rejected SSI due to a fundamental disagreement on vision.
[01:28:27 - 01:36:33] The importance of people in research, the true purpose of publishing papers, and resisting the self-centered concept of "impact."
[01:36:34 - 01:41:34] Transitioning to NYU, the open design of the Center for Data Science, and the visionary leadership of Yann LeCun.
[01:41:35 - 01:47:53] Collaborating with Li Fei-Fei, her ability to define problems (ImageNet), and spatial intelligence.
[01:47:54 - 01:53:39] The core components of representation learning (architecture, data, objective) and the transition to self-supervised learning.
[01:53:40 - 02:01:13] The limitations of supervised learning, early creative pretext tasks, and developing MoCo (Momentum Contrast) with He Kaiming.
[02:01:14 - 02:05:08] He Kaiming's extreme focus, research taste, and how he extracts key points from literature.
[02:05:09 - 02:10:51] The 6-month research cycle, the importance of hands-on exploration, and finding "gradients" in failed experiments.
[02:10:52 - 02:17:01] The non-linear nature of research time and results, and optimizing for the maximum (signature work) rather than the average.
[02:17:02 - 02:24:07] The shift of game-setting power from academia to industry, working part-time at Google, and listing 20-25 foundational AI papers.
[02:24:08 - 02:32:04] Scaling up self-supervised learning, the development of Masked Autoencoders (MAE), and its limitations.
[02:32:05 - 02:38:39] He Kaiming's engineering of TPU infrastructure, the importance of a strong baseline, and using Excel spreadsheets to track experiments.
[02:38:40 - 02:44:34] Exploring generative models, the blur between discriminative and generative models, and the freedom of research at FAIR.
[02:44:35 - 02:55:42] Defining research taste, He Kaiming's writing habits, and the comparison between research and filmmaking (Robert McKee's Story).
[02:55:43 - 03:00:38] Transitioning from FAIR to NYU as a professor, the challenges of administrative work, and the New York AI community.
[03:00:39 - 03:08:47] The development of DiT (Diffusion Transformers) at FAIR, its rejection by CVPR, and its eventual adoption by Sora.
[03:08:48 - 03:22:55] A generic discussion on starting a company (Oasis), open problems in AI (reasoning, efficiency, alignment), and advice for young researchers.
[03:22:56 - 03:26:40] The financial difficulties faced by junior faculty in the US, seeking sponsorship from Google, and starting the Cambrian project.
[03:26:41 - 03:34:45] The influence of film directors (Jia Zhangke, Bi Gan) on video understanding, and publishing Cambrian-S to define a multi-stage roadmap for multimodal AI.
[03:34:46 - 03:40:15] Defining computer vision as a fundamental perspective of intelligence rather than a specific task.
[03:40:16 - 03:47:43] The impact of LLMs on computer vision, the risk of language dependency, and the future of visual intelligence in Robotics.
[03:47:44 - 03:51:26] Focusing on the "robot brain" at the software level, and why vision does not need a traditional Scaling Law.
[03:51:27 - 03:56:03] Why language models are strongly supervised processes operating in semantic space, and the limitations of language as a communication tool.
[03:56:04 - 04:01:12] Developing V* at NYU for test-time scaling, inspiring OpenAI's "think with image," and the closing of industrial research labs.
[04:01:13 - 04:06:34] Developing REPA (representation alignment) and RAE (Representation Autoencoder), and why high-dimensional spaces are crucial for machine learning.
[04:06:35 - 04:11:15] The future of world models where language models degrade to simple communication interfaces, and why LLMs are anti-Bitter Lesson.
[04:11:16 - 04:17:23] Defining world models mathematically, their history (Kenneth Craik, Model Predictive Control), and Richard Sutton's Dyna paper (System 1 vs. System 2).
[04:17:24 - 04:25:49] The relationship between state, representation, and abstraction, and why LLMs are flawed, uncontrollable world models.
[04:25:50 - 04:32:46] Comparing different definitions of world models (Sora/Genie as world simulators, World Labs as 3D spatial intelligence, and AMI Labs as a predictive brain).
[04:32:47 - 04:41:42] Why the serialized token-based modeling of LLMs is fundamentally flawed for continuous spatial signals, and modeling P(x|y) vs. P(y).
[04:41:43 - 04:47:07] The unique Scaling Law of world models, the high-bandwidth filtering system of the human brain, and the difficulty of training world models.
[04:47:08 - 04:55:26] The data challenge of "downloading humanity," terms of service issues with YouTube, and potential product outlets (AI glasses, Robotics).
[04:55:27 - 05:01:18] The transition from representation learning to world models, and the decision to co-found AMI Labs with Yann LeCun.
[05:01:19 - 05:08:31] The suffocating environment of closed big tech labs in Silicon Valley, and how the current AI value chain misallocates resources.
[05:08:32 - 05:13:44] Yann LeCun's principled nature, his vision-driven leadership style, and why they complement each other.
[05:13:45 - 05:24:39] Why they chose New York over Silicon Valley, the team size (15 members in this segment), and advice to the Chinese AI community.
[05:24:40 - 05:31:35] Guest segment with Tommy (Zhiyuan Zeng), PhD student at NYU and co-founder/CTO of Simular AI, discussing desktop AI agent S2.
[05:31:36 - 05:39:52] Returning to Yann LeCun's personality, his defense of scientific integrity, and his management philosophy of "sailing a boat."
[05:39:53 - 06:44:24] AMI Labs' capital status (target around $1 billion, 25 initial members), attracting talent from OpenAI/Meta, Yann LeCun's diverse hobbies, and reflections on animal intelligence.

Key Claims

Computer vision is not just a specific task or field, but a fundamental perspective of intelligence that deals with continuous, high-dimensional, noisy signals and hierarchical representation. Evidence "vision in my definition it's a perspective it's not a specific task... it's the essence of intelligence" [03:35:19 - 03:39:51] Type Opinion
Large Language Models (LLMs) are fundamentally flawed as world models because they operate purely in discrete semantic/token space, which is highly redundant and lacks the capacity to model continuous spatial dynamics. Evidence "the modeling technique of language models cannot resolve the cognition of these continuous spatial signals this doesn't hold" [04:31:16 - 04:31:30] Type Opinion
Language is a highly condensed communication tool developed by humans, not a direct map of thinking or decision-making; therefore, relying solely on language models creates a "crutch" that limits the development of real intelligence. Evidence "language is a communication tool language is not a thinking map language is not even a decision-making tool... it's a crutch" [03:53:57 - 04:10:06] Type Opinion
The "Bitter Lesson" does not apply to LLMs because language itself is a highly structured, human-supervised product of civilization, whereas a true world model must spontaneously learn latent representations without human-designed linguistic constraints. Evidence "I absolutely don't think the Large Language Model is a demonstration of The Bitter Lesson... language is an extremely clever product of humans" [04:10:53 - 04:23:16] Type Opinion
A true world model is a predictive brain that characterizes environmental states to forecast the consequences of actions, enabling planning and reasoning (System 2 thinking) rather than just reactive policies (System 1). Evidence "the essence of a World Model is how to characterize a system and an environment such that you can make predictions... and this prediction can guide your action sequence" [04:17:24 - 04:17:41] Type Opinion
High-dimensional spaces are crucial cornerstones of machine learning because they allow complex problems and linear separability that are impossible to resolve in low-dimensional spaces. Evidence "you must not be afraid of high dimensions high dimensionality is in all machine learning an extremely important cornerstone" [04:04:01 - 04:05:02] Type Fact
Research is non-linear in both time and results; a researcher only needs to succeed once with a "signature work" (optimizing for the maximum, not the average) to define their career. Evidence "what you optimize for is not an average... but what you're optimizing is the maximum of your work... you only need to succeed just once in your lifetime" [02:15:08 - 02:15:37] Type Opinion
The current AI industry value chain is dominated by closed big tech labs competing on leaderboards, which misallocates resources, suffocates academic freedom, and forces researchers into short-term product cycles instead of fundamental problem-defining. Evidence "this has defined a series of benchmarks... these benchmarks define resource allocation... it sucks away the oxygen in that environment" [05:01:19 - 05:08:31] Type Opinion
"General intelligence" (AGI) is a false premise because human intelligence is highly specialized and limited by biological bandwidth; recreating the physical survival intelligence of a squirrel is a much harder problem than coding or math. Evidence "AGI is a false premise... human intelligence is a very specialized intelligence... building the intelligence of a squirrel is the hard problem" [06:07:44 - 06:13:49] Type Opinion
The future of AI lies in a multi-component cognitive architecture where the world model serves as the foundational base layer, and the language model degrades into a simple communication interface. Evidence "the future won't be like this... the language, LLM layer will gradually become... an interface of [the world model]" [04:07:11 - 04:08:05] Type Prediction Note on Uncertainty Xie Saining hedges this prediction by stating, "my current intuition is the model won't be that large... whether it's right or wrong we can look again in a few years."

In Plain Language

Imagine sitting down with a brilliant, incredibly humble friend who has spent years at the absolute frontier of artificial intelligence, working alongside the legends of the field. That is exactly what it feels like to listen to Xie Saining. He does not view himself as some "chosen one" or a flawless prodigy [00:01:02]; instead, he describes his trajectory as a series of non-linear, almost accidental steps guided by a stubborn insistence on doing exactly what he finds fascinating [00:09:52, 00:15:29].

His journey started in a relaxed family environment with a father who was a psychologist and media person carrying a camera everywhere [00:05:33, 00:08:20]. This early exposure to visual media and books shaped his open worldview [00:08:56]. Later, he was admitted to the prestigious ACM Class at SJTU [00:04:33]. During his entrance interview, a senior professor, Shen Enshao, asked him what books he liked [00:13:50]. Xie Saining mentioned What Is Mathematics? by Richard Courant [00:14:23]. In a beautiful twist of fate, Xie Saining is now a professor at NYU's Courant Institute of Mathematical Sciences—the very institute built by Richard Courant [00:14:55].

While at SJTU, Xie Saining discovered computer vision, influenced deeply by legendary senior student Hou Xiaodi and the books he read on consciousness and the brain [00:15:30, 00:16:24]. He explains vision not as a narrow task, but as a fundamental perspective of intelligence itself [03:35:19]. He points to the Cambrian Explosion 530 million years ago, when creatures suddenly evolved eyes, triggering a massive evolutionary arms race [00:26:30]. Vision is the only part of our brain directly exposed to the physical world [00:28:17]; therefore, solving vision is equivalent to solving intelligence itself [00:28:32].

When it came time for his third-year internship, the established path was to go to MSRA [00:20:56]. But because MSRA's vision group was reluctant to take undergrads who "didn't know anything" [00:21:42], Xie Saining took the initiative to cold-email NUS in Singapore and secured an internship on his own, demonstrating his early independent streak [00:22:57].

His PhD application process was equally rocky. He was nearly left with no offers in computer vision until he was rescued at the last minute by Tu Zhuowen [00:35:57]. When Tu Zhuowen decided to move from UCLA to UCSD, Xie Saining immediately chose to follow him, completely ignoring school rankings because he cared only about who he was working with [00:37:50, 00:39:36]. Tu Zhuowen was an incredibly rigorous mentor who would sit next to Xie Saining's monitor and go through code line-by-line [04:41:42]. Tu Zhuowen's generation had to build everything from scratch—writing 50,000 lines of C++ code just for image segmentation [04:42:23].

During his PhD, Xie Saining co-authored Deeply Supervised Nets (DSN), which solved the vanishing gradient problem by adding intermediate supervision exits to neural networks [04:45:32, 04:47:11]. Although the paper was initially rejected by NeurIPS due to a simple typo (forgetting a squared term in a formula) [15:15, 15:49], it went on to win the Test of Time Award ten years later at AISTATS [16:35]. Xie Saining uses this to explain that research is not a "point estimate" where you evaluate your worth at every single moment; it is an "integral" of your lifetime accumulation [17:19]. He also published Holistically-Nested Edge Detection (HED), which earned a Marr Prize nomination [48:25, 50:33].

Xie Saining did five diverse internships during his PhD, half of which produced absolutely nothing [52:21, 54:50]. He tells his students this to show that failing to produce work during an internship is not the end of the world [57:04]. His turning point came during his internship at Meta's FAIR, when He Kaiming joined the lab [57:43]. Because He Kaiming had only programmed on Windows at Microsoft, Xie Saining had to drive him around, teach him how to use Linux, and show him how to run jobs on the cluster [58:17, 58:32]. Together, they built ResNeXt for the ImageNet challenge—a parallel network design that got second place but laid the conceptual groundwork for what we now call Mixture of Experts (MoE) [59:50, 01:01:57].

Xie Saining also interned at DeepMind in London during a freezing, painful winter, working on reinforcement learning (RL) and robotics [01:05:36, 01:06:11]. While he realized he disliked RL and robotics, he was fascinated by DeepMind's organizational structure, which transitioned seamlessly from bottom-up exploration to highly organized, top-down execution [01:06:42, 01:07:40]. He recalls Demis Hassabis telling interns that DeepMind's ultimate mission was to become a company that wins multiple Nobel Prizes—a claim that seemed far-fetched then but has now been realized [01:08:12, 01:08:38].

Throughout all these projects, the unifying thread is representation learning [01:09:55]. Xie Saining defines this as mapping raw data into a structured space with good properties that make downstream tasks easier [01:12:32]. He warns against chasing fleeting trends like Neural Architecture Search (NAS), which wasted two years of the entire field's time, and advocates for focusing on timeless, fundamental problems [01:13:48, 01:14:58].

His career choices highlight his commitment to this philosophy. In 2018, he interviewed at OpenAI, where John Schulman gave him interview questions handwritten in pencil on an A4 sheet of paper [01:19:30, 01:20:00]. Although he received an offer, he rejected OpenAI to join FAIR because it was the "holy temple" of computer vision, home to He Kaiming, Piotr Dollar, and Ross Girshick [01:20:13, 01:20:42]. Ilya Sutskever called him, very angry, asking if the money wasn't enough (at the time, top PhD offers were around $400k-$500k) [01:21:08, 01:21:35]. In 2024, Ilya Sutskever called him a second time after founding SSI [01:25:19]. They discussed how to give AI the ability to love (and the reality that love always brings hate) [01:25:43, 01:27:20]. When Xie Saining asked Ilya how he viewed vision and multimodality, Ilya replied that it was already "solved well enough" [01:25:54, 01:26:10]. Because Xie Saining fundamentally disagreed, he rejected SSI [01:26:19].

Xie Saining has a deep aversion to the aggressive, self-centered word "impact" [01:31:36]. Citing the political philosopher Hannah Arendt, he explains that the purpose of research is not to aggressively force change on the world, but to seek understanding and a "sense of family" by being understood by others [01:31:54, 01:32:36]. He also dislikes the phrase "Xie Saining's team" because it steals credit from the young students who actually did the hard work [01:35:26, 01:35:56].

After FAIR, Xie Saining joined NYU as a professor, drawn by the open, glass-doored Center for Data Science designed by Yann LeCun [01:36:34, 01:38:55]. He also collaborated with Li Fei-Fei, whom he admires as a master of "defining problems" [01:41:35, 01:43:18]. He notes that Li Fei-Fei's true achievement with ImageNet was not just gathering data, but clearly defining the problem of image classification when it was completely unstandardized [01:43:19, 01:43:54].

He explains the shift from supervised learning to self-supervised learning using a concrete metaphor [01:47:54]. In supervised learning, a neural network is forced to map infinite variations of a "chair" (including an avocado-shaped designer chair) to a single label, "chair" [01:54:00, 01:54:25]. To do this, the network often cheats by relying on "spurious correlations," like looking at the background or assuming a chair must be next to a table [01:54:55, 01:55:07]. Self-supervised learning aims to give AI human-like "common sense" and intuition directly from raw visual data [01:55:18, 01:55:30]. Early pretext tasks (like rotating images, colorization, or context encoders) were highly creative but performed 15-20% worse than supervised pre-training [01:56:04, 01:58:03]. This changed when he and He Kaiming developed MoCo (Momentum Contrast), which made contrastive learning work by measuring distances in representation space [01:58:31, 01:59:35].

Xie Saining describes He Kaiming as the absolute best researcher he knows, possessing an extreme focus and "flow state" [02:01:04, 02:01:20]. He Kaiming taught him that research ideas cannot be dreamt up by sitting in a corner; they must be discovered through empirical exploration—a process of "stochastic gradient descent" [02:04:15, 02:07:31]. In a typical 6-month research cycle, the first 1-2 months are spent hacking and playing with code like a toy [02:05:19, 02:06:36]. By the 5th month, the researcher's mindset often collapses, only for a non-linear burst of inspiration to deliver the final result in the last month [02:10:52, 02:11:28]. The worst research ends exactly where it started because it was boring and encountered no obstacles; the best research takes a chaotic, winding path [02:09:58, 02:12:05]. Citing Bill Freeman's curve, Xie Saining notes that poor or decent work has zero career impact, but a "signature work" shoots straight to the top [02:13:47, 02:15:00]. You only need to succeed once in your life [02:15:34].

Today, the power to set the rules of the game has shifted from academia to closed industry giants like OpenAI, Google, and Meta, leaving academic researchers chasing industry with "peanuts of resources" [02:17:02, 02:18:14]. To navigate this, Xie Saining worked part-time at Google for two years to see what they were doing, so he knew exactly what not to do in academia [02:18:43, 02:19:17].

While at FAIR, he and intern Bill Peebles (now head of Sora) developed DiT (Diffusion Transformers) [03:00:39, 03:02:42]. CVPR originally rejected the paper because it was "too simple" and lacked complex math, but it eventually became the foundational backbone of Sora and almost every major video generation model today [03:06:13, 03:06:31, 03:08:24].

He also highlights the severe financial struggles of junior faculty in the US, where NSF grants average a tiny $100k/year per PI—barely enough for one student's tuition or a few GPUs [03:22:56, 03:24:21]. To secure resources, Xie Saining once had to go hiking on a trail next to Google's campus with a collaborator to pitch for sponsorship, a process he describes as "alms-seeking" [03:25:14, 03:26:00].

This resourcefulness led to the Cambrian project and Cambrian-S, a position paper defining a multi-stage roadmap for multimodal AI (from L0 language-only, to L1 show-and-tell, L2 streaming event cognition, L3 spatial cognition, and finally L4/L5 predictive world models) [03:26:33, 03:30:43]. His passion for video understanding is deeply influenced by film directors Jia Zhangke and Bi Gan [03:27:40]. Bi Gan's long takes in Kaili Blues represent how space extends time on a linear timeline [03:27:55, 03:29:04]. Life is a single long take, and video is the ultimate medium for physical world understanding [03:28:14, 03:28:30].

Xie Saining argues that Large Language Models (LLMs) are fundamentally flawed as world models because they operate purely in discrete token space and lack physical dynamics [04:24:00, 04:31:16]. Language is a highly condensed communication tool, not a direct map of thinking; relying solely on LLMs is like using a "crutch" that prevents you from training your leg muscles [03:53:57, 03:55:15]. Furthermore, LLMs are actually strongly supervised processes operating in human-curated semantic space (y-space), which violates the true spirit of the Bitter Lesson [03:51:18, 03:52:50, 04:10:53].

To illustrate the mathematical essence of a world model, he uses the transition function $S_{t+1} = F(S_t, a_t)$, where a system predicts its next state based on its current state and an action [04:11:56, 04:12:13]. This enables Model Predictive Control (MPC)—rolling out action sequences to plan and minimize cost [04:13:44, 04:14:35]. He references Richard Sutton's classic Dyna paper to contrast reactive policies (System 1) with model-based planning (System 2) [04:15:24, 04:15:47].

He clearly distinguishes different industry definitions of world models [04:25:50]: 1. Sora/Genie: World simulators focused on rendering visually compelling, consistent videos for humans [04:26:51, 04:27:22]. 2. World Labs (Li Fei-Fei): Spatial intelligence utilizing explicit 3D representations [04:27:56, 04:28:36]. 3. AMI Labs (Yann LeCun & Xie Saining): A predictive brain designed to enhance intelligence itself [04:29:12, 04:29:20].

Xie Saining notes that the human brain has an input bandwidth of 100M to 1B bits per second across all sensors, but our behavioral output bandwidth is only 10 to 100 bits per second [04:46:09, 04:46:40]. The brain is a massive, hierarchical filtering system operating on just 20 watts of power [04:46:39, 04:46:56]. To train a world model to replicate this, we must "download humanity" using massive video data [04:47:52, 04:48:45]. This presents a massive data challenge, as platforms like YouTube heavily guard their data, leading to a constant cat-and-mouse game with scraping [04:49:40, 04:50:11].

This pursuit of a true world model led to the co-founding of AMI Labs with Yann LeCun [04:55:27, 05:00:06]. Xie Saining explains that closed Silicon Valley labs have become suffocating, competitive pressure cookers that block academic freedom, hide author credits, and prevent researchers from open-sourcing their work [05:01:19, 05:02:30, 05:04:02]. Yann LeCun decided to build a research-driven startup outside this closed ecosystem [05:00:56, 05:01:42]. Yann LeCun is "very JEPA" as a person—principled, scientifically honest, and completely undisturbed by external hype [05:10:48, 05:35:07]. He manages the company like "sailing a boat," giving team members complete trust and autonomy until adjustment is needed [05:38:53, 05:39:11]. Yann LeCun is also a true multi-hyphenate with four major hobbies: building model airplanes, astrophotography, electronic/jazz music, and sailing [05:47:38, 05:48:51].

AMI Labs has raised capital targeting a $3 billion valuation and assembled an initial team of 25 world-class members [05:39:53, 05:41:37, 05:46:17]. Some members gave up tens of millions of dollars in unvested OpenAI stock to join, driven purely by the mission [05:42:32, 05:42:55].

Ultimately, Xie Saining believes that "AGI" is a false premise because human intelligence is highly specialized and limited by biological bandwidth [06:07:44, 06:08:30]. Citing the evolutionary biologist de Waal and reinforcement learning pioneer Richard S. Sutton, he notes that recreating the physical survival intelligence of a squirrel—which has its own goals, emotions, and social dynamics to survive in the real world—is a much harder problem than writing code or solving math equations [06:08:56, 06:13:16]. Once we can build the physical intelligence of a squirrel, the rest will be easy [06:13:23].

Worth a Second Listen

[01:25:19 - 01:28:26]: The second phone call from Ilya Sutskever.
Why listen: This segment captures a fascinating, high-density philosophical clash between two AI paradigms. You can hear the contrast between Ilya Sutskever's language-centric vision (and his poetic question about how to give AI the "ability to love") and Xie Saining's deep conviction that vision is far from solved. It highlights the exact moment their technical roadmaps diverged.
[02:01:14 - 02:05:08]: He Kaiming's daily research habits and focus.
Why listen: Xie Saining's tone here is filled with genuine, deep admiration. He demystifies the "genius" of He Kaiming, explaining his "flow state" and how he systematically extracts key points from literature. It is a rare, intimate look at the work ethic and "reality distortion field" of one of AI's greatest minds.
[03:51:27 - 03:56:03]: Why language is a "crutch" and LLMs are strongly supervised.
Why listen: This is a highly counterintuitive and sharp argument. Xie Saining explains why language models are actually strongly supervised processes operating in human-curated semantic space, rather than pure self-supervised systems. His use of the "crutch" metaphor is vivid and delivers a powerful critique of the current LLM hype.
[04:41:43 - 04:47:07]: The human brain's bandwidth and the filtering system.
Why listen: This segment is incredibly information-dense. Xie Saining breaks down the sheer mathematical discrepancy between human sensory input bandwidth (1 billion bits/second) and our low-bandwidth behavioral output (10 bits/second). It provides the core biological justification for why AMI Labs is building a predictive, filtering world model rather than a generative pixel-reconstructor.
[06:07:44 - 06:13:49]: The "squirrel intelligence" argument and letting go of human arrogance.
Why listen: This is the philosophical climax of the interview. Xie Saining, citing Richard S. Sutton, argues why recreating a squirrel's physical survival intelligence is infinitely harder than coding or solving math. The tone is deeply reflective, challenging the audience to let go of "human arrogance" when defining what real intelligence actually means.

Resonances with past episodes

Complements→ The Reality of Frontier AI and the End of Individual Heroism · Yao Shunyu
Yao's explanation of why coding is an easy, fast-growing AI scenario (due to clear feedback and structured data) complements Xie's view that coding and math are actually much easier problems for AI to solve than the physical, continuous survival intelligence of a squirrel.
This[06:07:44 - 06:13:49] "General intelligence" (AGI) is a false premise because human intelligence is highly specialized and limited by biological bandwidth; recreating the physical survival intelligence of a squirrel is a much harder problem than coding or math.
Related[00:35:54 - 00:37:09] Programming is the fastest-growing AI scenario because it has highly explicit feedback signals and a natural data foundation in GitHub.
Parallel→ The Reality of Frontier AI and the End of Individual Heroism · Yao Shunyu
Both agree that the era of relying solely on pure language models is ending, and that the future of AI development lies in moving beyond text into multimodal, physical, and world-modeling architectures.
This[04:07:11 - 04:08:05] The future of AI lies in a multi-component cognitive architecture where the world model serves as the foundational base layer, and the language model degrades into a simple communication interface.
Related[03:33:41 - 03:34:46] Purely working on language models is no longer a blue ocean; the "last train" has already left, and future opportunities lie in robotics, multi-modal generation, and applying AI to real scientific problems.
Corroborates→ The Core Algorithm of AlphaGo · Eric Jang
Jang's technical observation that language is an unsuitable medium for structured search and reasoning corroborates Xie's broader philosophical claim that language is merely a condensed communication tool rather than a direct map of thinking.
This[03:53:57 - 04:10:06] Language is a highly condensed communication tool developed by humans, not a direct map of thinking or decision-making; therefore, relying solely on language models creates a "crutch" that limits the development of real intelligence.
Related[01:47:45 - 01:50:32] Applying MCTS directly to LLM reasoning is difficult because language's action space is combinatorially larger and less discrete, making it hard to define a reliable intermediate value function for search.
Corroboration← Human Data and Robotics' GPT-3 Moment · Danfei Xu
Both point out that there is a fundamental chasm between discrete symbolic/language spaces and the continuous physical world, meaning that language model-dominated paths cannot truly solve spatial cognition and robot control problems in the physical world.
This[04:31:16 - 04:31:30] Language model modeling techniques cannot solve the cognition of these continuous spatial signals; this is untenable because they operate entirely in discrete semantic/token spaces and lack the ability to model continuous spatial dynamics.
Related[55:54 - 56:16] The robotics path dominated by language (LLMs) as a foundational capability is wrong because the symbolic layer and the physical layer are too far apart to solve fine manipulation and physical common sense problems.
Confirms← Action-Level Mental Model Dataset for Agent Collaboration · Jiaju Chen
The two are highly aligned on the mechanism of 'behavior prediction through internal models.' The former's 'mental models providing incremental signals for predicting future human interactive behaviors' concretely implements and confirms, in the dimension of human-agent collaboration, the predictive brain mechanism advocated by the latter, where 'world models predict action consequences and guide action sequences by characterizing system and environmental states.'
This[04:17:24 - 04:17:41] A true world model is a predictive brain that characterizes environmental states to predict the consequences of actions, thereby enabling planning and reasoning (System 2 thinking), rather than just reactive strategies (System 1 thinking).
Related4.3.1. Next Action Prediction · "next action prediction is" Mental models can provide incremental signals for agents to predict and simulate future human interactive behaviors that traditional historical trajectories cannot provide.
Confirms← Action-Level Mental Model Dataset for Agent Collaboration · Jiaju Chen
The difficulty of large models in inferring humans' private self-reasoning processes confirms the latter's view that 'language is not a direct mapping of thought or decision-making,' indicating that large models trained solely on linguistic text have fundamental limitations in reconstructing deep mental logic that humans do not explicitly express.
This[03:53:57 - 04:10:06] Language is a highly compressed communication tool developed by humans, not a direct mapping of thought or decision-making; therefore, relying solely on language models creates a 'crutch' that limits the development of true intelligence.
Related4.3.2. Mental Model Prediction · "hardest dimension to predict" Large models perform acceptably in predicting shared mental model dimensions, but face severe bottlenecks in inferring private self-reasoning.
Corroboration← The Era of Experience: Reinforcement Learning Beyond Human Data · David Silver
Both point out that human language should not be viewed as the ultimate vehicle for intelligence or thought, and that over-reliance on human language limits the development of artificial intelligence. Agents should explore non-linguistic, more efficient underlying computational and thinking mechanisms.
This[03:53:57 - 04:10:06] Language is a highly compressed communication tool developed by humans, rather than a direct mapping of thought or decision-making; therefore, relying solely on language models creates a "crutch" that limits the development of true intelligence.
RelatedPlanning and Reasoning · "discover or improve such approaches by learning how to" Agents can use non-human languages (such as symbolic, distributed, or continuous computation) to discover or improve more efficient thinking mechanisms, without being limited to mimicking human chains of thought.
Complement← The Era of Experience: Reinforcement Learning Beyond Human Data · David Silver
Both advocate shifting the focus of artificial intelligence research from human-specific symbolic and textual interactions (such as writing code or chatting) to the more challenging embodied intelligence that autonomously interacts and survives in the real world.
This[06:07:44 - 06:13:49] AGI is a false premise... human intelligence is a highly specialized intelligence... building the intelligence of a squirrel is the real challenge
RelatedActions and Observations · "act autonomously in the real world" Agents will have richer action and observation spaces to interact autonomously in the real or digital world, rather than being limited to human-privileged formats (such as pure text dialogue).

Tensions with past episodes

ContrastApparent tension→ The Core Algorithm of AlphaGo · Eric Jang
Jang highlights how AlphaGo successfully compresses sequential reasoning (System 2 search) into a single reactive forward pass (System 1 amortization), whereas Xie argues that a true world model must explicitly preserve predictive planning (System 2) to guide actions rather than relying on reactive policies.
This[04:17:24 - 04:17:41] The essence of a World Model is how to characterize a system and an environment such that you can make predictions... and this prediction can guide your action sequence, enabling planning and reasoning (System 2 thinking) rather than just reactive policies (System 1).
Related[01:17:31 - 01:18:46] A shallow neural network can learn to "amortize" a huge search, compressing complex, sequential reasoning into a single, parallel forward pass.

A faithful reconstruction and plain-language retelling of the episode, generated by PodLens.

This is one source-grounded reading, not a replacement for the original. Every point is anchored to its source, so you can check it yourself — and corrections are welcome.