← PodLens中文

Representation Learning and Predictive World Models · Saining Xie

2026-06-04 · A faithful, transcript-grounded reading by PodLens

Original episode:https://youtu.be/rIwgZWzUKm8?si=visnqbvS_b-eqcLF · Timestamps are clickable — they seek the player in place

Diffusion TransformerRepresentation LearningPredictive World ModelsSelf-Supervised LearningModel Predictive Control

What This Episode Covers

This episode features an in-depth, marathon interview with young scientist and entrepreneur Xie Saining, alongside a brief guest segment with Tommy (Zhiyuan Zeng). The central focus of the discussion is Xie Saining's academic and professional journey in artificial intelligence, specifically focusing on representation learning, the development of the Diffusion Transformer (DiT), and his transition from academia to co-founding the startup AMI Labs with Turing Award winner Yann LeCun. The conversation explores the fundamental limitations of Large Language Models (LLMs) as world models, the definition of real intelligence, and the technical and philosophical roadmap toward building a predictive "world model" that can understand the physical world.

Timeline & Topic Map

Key Claims

  1. Computer vision is not just a specific task or field, but a fundamental perspective of intelligence that deals with continuous, high-dimensional, noisy signals and hierarchical representation. Evidence "vision in my definition it's a perspective it's not a specific task... it's the essence of intelligence" [03:35:19 - 03:39:51] Type Opinion

  2. Large Language Models (LLMs) are fundamentally flawed as world models because they operate purely in discrete semantic/token space, which is highly redundant and lacks the capacity to model continuous spatial dynamics. Evidence "the modeling technique of language models cannot resolve the cognition of these continuous spatial signals this doesn't hold" [04:31:16 - 04:31:30] Type Opinion

  3. Language is a highly condensed communication tool developed by humans, not a direct map of thinking or decision-making; therefore, relying solely on language models creates a "crutch" that limits the development of real intelligence. Evidence "language is a communication tool language is not a thinking map language is not even a decision-making tool... it's a crutch" [03:53:57 - 04:10:06] Type Opinion

  4. The "Bitter Lesson" does not apply to LLMs because language itself is a highly structured, human-supervised product of civilization, whereas a true world model must spontaneously learn latent representations without human-designed linguistic constraints. Evidence "I absolutely don't think the Large Language Model is a demonstration of The Bitter Lesson... language is an extremely clever product of humans" [04:10:53 - 04:23:16] Type Opinion

  5. A true world model is a predictive brain that characterizes environmental states to forecast the consequences of actions, enabling planning and reasoning (System 2 thinking) rather than just reactive policies (System 1). Evidence "the essence of a World Model is how to characterize a system and an environment such that you can make predictions... and this prediction can guide your action sequence" [04:17:24 - 04:17:41] Type Opinion

  6. High-dimensional spaces are crucial cornerstones of machine learning because they allow complex problems and linear separability that are impossible to resolve in low-dimensional spaces. Evidence "you must not be afraid of high dimensions high dimensionality is in all machine learning an extremely important cornerstone" [04:04:01 - 04:05:02] Type Fact

  7. Research is non-linear in both time and results; a researcher only needs to succeed once with a "signature work" (optimizing for the maximum, not the average) to define their career. Evidence "what you optimize for is not an average... but what you're optimizing is the maximum of your work... you only need to succeed just once in your lifetime" [02:15:08 - 02:15:37] Type Opinion

  8. The current AI industry value chain is dominated by closed big tech labs competing on leaderboards, which misallocates resources, suffocates academic freedom, and forces researchers into short-term product cycles instead of fundamental problem-defining. Evidence "this has defined a series of benchmarks... these benchmarks define resource allocation... it sucks away the oxygen in that environment" [05:01:19 - 05:08:31] Type Opinion

  9. "General intelligence" (AGI) is a false premise because human intelligence is highly specialized and limited by biological bandwidth; recreating the physical survival intelligence of a squirrel is a much harder problem than coding or math. Evidence "AGI is a false premise... human intelligence is a very specialized intelligence... building the intelligence of a squirrel is the hard problem" [06:07:44 - 06:13:49] Type Opinion

  10. The future of AI lies in a multi-component cognitive architecture where the world model serves as the foundational base layer, and the language model degrades into a simple communication interface. Evidence "the future won't be like this... the language, LLM layer will gradually become... an interface of [the world model]" [04:07:11 - 04:08:05] Type Prediction Note on Uncertainty Xie Saining hedges this prediction by stating, "my current intuition is the model won't be that large... whether it's right or wrong we can look again in a few years."

In Plain Language

Imagine sitting down with a brilliant, incredibly humble friend who has spent years at the absolute frontier of artificial intelligence, working alongside the legends of the field. That is exactly what it feels like to listen to Xie Saining. He does not view himself as some "chosen one" or a flawless prodigy [00:01:02]; instead, he describes his trajectory as a series of non-linear, almost accidental steps guided by a stubborn insistence on doing exactly what he finds fascinating [00:09:52, 00:15:29].

His journey started in a relaxed family environment with a father who was a psychologist and media person carrying a camera everywhere [00:05:33, 00:08:20]. This early exposure to visual media and books shaped his open worldview [00:08:56]. Later, he was admitted to the prestigious ACM Class at SJTU [00:04:33]. During his entrance interview, a senior professor, Shen Enshao, asked him what books he liked [00:13:50]. Xie Saining mentioned What Is Mathematics? by Richard Courant [00:14:23]. In a beautiful twist of fate, Xie Saining is now a professor at NYU's Courant Institute of Mathematical Sciences—the very institute built by Richard Courant [00:14:55].

While at SJTU, Xie Saining discovered computer vision, influenced deeply by legendary senior student Hou Xiaodi and the books he read on consciousness and the brain [00:15:30, 00:16:24]. He explains vision not as a narrow task, but as a fundamental perspective of intelligence itself [03:35:19]. He points to the Cambrian Explosion 530 million years ago, when creatures suddenly evolved eyes, triggering a massive evolutionary arms race [00:26:30]. Vision is the only part of our brain directly exposed to the physical world [00:28:17]; therefore, solving vision is equivalent to solving intelligence itself [00:28:32].

When it came time for his third-year internship, the established path was to go to MSRA [00:20:56]. But because MSRA's vision group was reluctant to take undergrads who "didn't know anything" [00:21:42], Xie Saining took the initiative to cold-email NUS in Singapore and secured an internship on his own, demonstrating his early independent streak [00:22:57].

His PhD application process was equally rocky. He was nearly left with no offers in computer vision until he was rescued at the last minute by Tu Zhuowen [00:35:57]. When Tu Zhuowen decided to move from UCLA to UCSD, Xie Saining immediately chose to follow him, completely ignoring school rankings because he cared only about who he was working with [00:37:50, 00:39:36]. Tu Zhuowen was an incredibly rigorous mentor who would sit next to Xie Saining's monitor and go through code line-by-line [04:41:42]. Tu Zhuowen's generation had to build everything from scratch—writing 50,000 lines of C++ code just for image segmentation [04:42:23].

During his PhD, Xie Saining co-authored Deeply Supervised Nets (DSN), which solved the vanishing gradient problem by adding intermediate supervision exits to neural networks [04:45:32, 04:47:11]. Although the paper was initially rejected by NeurIPS due to a simple typo (forgetting a squared term in a formula) [15:15, 15:49], it went on to win the Test of Time Award ten years later at AISTATS [16:35]. Xie Saining uses this to explain that research is not a "point estimate" where you evaluate your worth at every single moment; it is an "integral" of your lifetime accumulation [17:19]. He also published Holistically-Nested Edge Detection (HED), which earned a Marr Prize nomination [48:25, 50:33].

Xie Saining did five diverse internships during his PhD, half of which produced absolutely nothing [52:21, 54:50]. He tells his students this to show that failing to produce work during an internship is not the end of the world [57:04]. His turning point came during his internship at Meta's FAIR, when He Kaiming joined the lab [57:43]. Because He Kaiming had only programmed on Windows at Microsoft, Xie Saining had to drive him around, teach him how to use Linux, and show him how to run jobs on the cluster [58:17, 58:32]. Together, they built ResNeXt for the ImageNet challenge—a parallel network design that got second place but laid the conceptual groundwork for what we now call Mixture of Experts (MoE) [59:50, 01:01:57].

Xie Saining also interned at DeepMind in London during a freezing, painful winter, working on reinforcement learning (RL) and robotics [01:05:36, 01:06:11]. While he realized he disliked RL and robotics, he was fascinated by DeepMind's organizational structure, which transitioned seamlessly from bottom-up exploration to highly organized, top-down execution [01:06:42, 01:07:40]. He recalls Demis Hassabis telling interns that DeepMind's ultimate mission was to become a company that wins multiple Nobel Prizes—a claim that seemed far-fetched then but has now been realized [01:08:12, 01:08:38].

Throughout all these projects, the unifying thread is representation learning [01:09:55]. Xie Saining defines this as mapping raw data into a structured space with good properties that make downstream tasks easier [01:12:32]. He warns against chasing fleeting trends like Neural Architecture Search (NAS), which wasted two years of the entire field's time, and advocates for focusing on timeless, fundamental problems [01:13:48, 01:14:58].

His career choices highlight his commitment to this philosophy. In 2018, he interviewed at OpenAI, where John Schulman gave him interview questions handwritten in pencil on an A4 sheet of paper [01:19:30, 01:20:00]. Although he received an offer, he rejected OpenAI to join FAIR because it was the "holy temple" of computer vision, home to He Kaiming, Piotr Dollar, and Ross Girshick [01:20:13, 01:20:42]. Ilya Sutskever called him, very angry, asking if the money wasn't enough (at the time, top PhD offers were around $400k-$500k) [01:21:08, 01:21:35]. In 2024, Ilya Sutskever called him a second time after founding SSI [01:25:19]. They discussed how to give AI the ability to love (and the reality that love always brings hate) [01:25:43, 01:27:20]. When Xie Saining asked Ilya how he viewed vision and multimodality, Ilya replied that it was already "solved well enough" [01:25:54, 01:26:10]. Because Xie Saining fundamentally disagreed, he rejected SSI [01:26:19].

Xie Saining has a deep aversion to the aggressive, self-centered word "impact" [01:31:36]. Citing the political philosopher Hannah Arendt, he explains that the purpose of research is not to aggressively force change on the world, but to seek understanding and a "sense of family" by being understood by others [01:31:54, 01:32:36]. He also dislikes the phrase "Xie Saining's team" because it steals credit from the young students who actually did the hard work [01:35:26, 01:35:56].

After FAIR, Xie Saining joined NYU as a professor, drawn by the open, glass-doored Center for Data Science designed by Yann LeCun [01:36:34, 01:38:55]. He also collaborated with Li Fei-Fei, whom he admires as a master of "defining problems" [01:41:35, 01:43:18]. He notes that Li Fei-Fei's true achievement with ImageNet was not just gathering data, but clearly defining the problem of image classification when it was completely unstandardized [01:43:19, 01:43:54].

He explains the shift from supervised learning to self-supervised learning using a concrete metaphor [01:47:54]. In supervised learning, a neural network is forced to map infinite variations of a "chair" (including an avocado-shaped designer chair) to a single label, "chair" [01:54:00, 01:54:25]. To do this, the network often cheats by relying on "spurious correlations," like looking at the background or assuming a chair must be next to a table [01:54:55, 01:55:07]. Self-supervised learning aims to give AI human-like "common sense" and intuition directly from raw visual data [01:55:18, 01:55:30]. Early pretext tasks (like rotating images, colorization, or context encoders) were highly creative but performed 15-20% worse than supervised pre-training [01:56:04, 01:58:03]. This changed when he and He Kaiming developed MoCo (Momentum Contrast), which made contrastive learning work by measuring distances in representation space [01:58:31, 01:59:35].

Xie Saining describes He Kaiming as the absolute best researcher he knows, possessing an extreme focus and "flow state" [02:01:04, 02:01:20]. He Kaiming taught him that research ideas cannot be dreamt up by sitting in a corner; they must be discovered through empirical exploration—a process of "stochastic gradient descent" [02:04:15, 02:07:31]. In a typical 6-month research cycle, the first 1-2 months are spent hacking and playing with code like a toy [02:05:19, 02:06:36]. By the 5th month, the researcher's mindset often collapses, only for a non-linear burst of inspiration to deliver the final result in the last month [02:10:52, 02:11:28]. The worst research ends exactly where it started because it was boring and encountered no obstacles; the best research takes a chaotic, winding path [02:09:58, 02:12:05]. Citing Bill Freeman's curve, Xie Saining notes that poor or decent work has zero career impact, but a "signature work" shoots straight to the top [02:13:47, 02:15:00]. You only need to succeed once in your life [02:15:34].

Today, the power to set the rules of the game has shifted from academia to closed industry giants like OpenAI, Google, and Meta, leaving academic researchers chasing industry with "peanuts of resources" [02:17:02, 02:18:14]. To navigate this, Xie Saining worked part-time at Google for two years to see what they were doing, so he knew exactly what not to do in academia [02:18:43, 02:19:17].

While at FAIR, he and intern Bill Peebles (now head of Sora) developed DiT (Diffusion Transformers) [03:00:39, 03:02:42]. CVPR originally rejected the paper because it was "too simple" and lacked complex math, but it eventually became the foundational backbone of Sora and almost every major video generation model today [03:06:13, 03:06:31, 03:08:24].

He also highlights the severe financial struggles of junior faculty in the US, where NSF grants average a tiny $100k/year per PI—barely enough for one student's tuition or a few GPUs [03:22:56, 03:24:21]. To secure resources, Xie Saining once had to go hiking on a trail next to Google's campus with a collaborator to pitch for sponsorship, a process he describes as "alms-seeking" [03:25:14, 03:26:00].

This resourcefulness led to the Cambrian project and Cambrian-S, a position paper defining a multi-stage roadmap for multimodal AI (from L0 language-only, to L1 show-and-tell, L2 streaming event cognition, L3 spatial cognition, and finally L4/L5 predictive world models) [03:26:33, 03:30:43]. His passion for video understanding is deeply influenced by film directors Jia Zhangke and Bi Gan [03:27:40]. Bi Gan's long takes in Kaili Blues represent how space extends time on a linear timeline [03:27:55, 03:29:04]. Life is a single long take, and video is the ultimate medium for physical world understanding [03:28:14, 03:28:30].

Xie Saining argues that Large Language Models (LLMs) are fundamentally flawed as world models because they operate purely in discrete token space and lack physical dynamics [04:24:00, 04:31:16]. Language is a highly condensed communication tool, not a direct map of thinking; relying solely on LLMs is like using a "crutch" that prevents you from training your leg muscles [03:53:57, 03:55:15]. Furthermore, LLMs are actually strongly supervised processes operating in human-curated semantic space (y-space), which violates the true spirit of the Bitter Lesson [03:51:18, 03:52:50, 04:10:53].

To illustrate the mathematical essence of a world model, he uses the transition function $S_{t+1} = F(S_t, a_t)$, where a system predicts its next state based on its current state and an action [04:11:56, 04:12:13]. This enables Model Predictive Control (MPC)—rolling out action sequences to plan and minimize cost [04:13:44, 04:14:35]. He references Richard Sutton's classic Dyna paper to contrast reactive policies (System 1) with model-based planning (System 2) [04:15:24, 04:15:47].

He clearly distinguishes different industry definitions of world models [04:25:50]: 1. Sora/Genie: World simulators focused on rendering visually compelling, consistent videos for humans [04:26:51, 04:27:22]. 2. World Labs (Li Fei-Fei): Spatial intelligence utilizing explicit 3D representations [04:27:56, 04:28:36]. 3. AMI Labs (Yann LeCun & Xie Saining): A predictive brain designed to enhance intelligence itself [04:29:12, 04:29:20].

Xie Saining notes that the human brain has an input bandwidth of 100M to 1B bits per second across all sensors, but our behavioral output bandwidth is only 10 to 100 bits per second [04:46:09, 04:46:40]. The brain is a massive, hierarchical filtering system operating on just 20 watts of power [04:46:39, 04:46:56]. To train a world model to replicate this, we must "download humanity" using massive video data [04:47:52, 04:48:45]. This presents a massive data challenge, as platforms like YouTube heavily guard their data, leading to a constant cat-and-mouse game with scraping [04:49:40, 04:50:11].

This pursuit of a true world model led to the co-founding of AMI Labs with Yann LeCun [04:55:27, 05:00:06]. Xie Saining explains that closed Silicon Valley labs have become suffocating, competitive pressure cookers that block academic freedom, hide author credits, and prevent researchers from open-sourcing their work [05:01:19, 05:02:30, 05:04:02]. Yann LeCun decided to build a research-driven startup outside this closed ecosystem [05:00:56, 05:01:42]. Yann LeCun is "very JEPA" as a person—principled, scientifically honest, and completely undisturbed by external hype [05:10:48, 05:35:07]. He manages the company like "sailing a boat," giving team members complete trust and autonomy until adjustment is needed [05:38:53, 05:39:11]. Yann LeCun is also a true multi-hyphenate with four major hobbies: building model airplanes, astrophotography, electronic/jazz music, and sailing [05:47:38, 05:48:51].

AMI Labs has raised capital targeting a $3 billion valuation and assembled an initial team of 25 world-class members [05:39:53, 05:41:37, 05:46:17]. Some members gave up tens of millions of dollars in unvested OpenAI stock to join, driven purely by the mission [05:42:32, 05:42:55].

Ultimately, Xie Saining believes that "AGI" is a false premise because human intelligence is highly specialized and limited by biological bandwidth [06:07:44, 06:08:30]. Citing the evolutionary biologist de Waal and reinforcement learning pioneer Richard S. Sutton, he notes that recreating the physical survival intelligence of a squirrel—which has its own goals, emotions, and social dynamics to survive in the real world—is a much harder problem than writing code or solving math equations [06:08:56, 06:13:16]. Once we can build the physical intelligence of a squirrel, the rest will be easy [06:13:23].


Worth a Second Listen

Resonances with past episodes

Tensions with past episodes

A faithful reconstruction and plain-language retelling of the episode, generated by PodLens.