Original episode:https://youtu.be/__P5yygfRRQ?si=BWi-1loCvJIIep2i · Timestamps are clickable — they seek the player in place
This episode explores the evolutionary path of robot learning, particularly how "human data" serves as the core fuel driving robotics toward its "GPT-3 moment." The guest is Danfei Xu, an Assistant Professor at Georgia Tech, who defines himself as a full-stack roboticist. The core question of the podcast is: how robot learning can break free from the blind worship of reinforcement learning and achieve generalization in low-level fine manipulation and physical common sense through large-scale, or even unintentionally collected, human data.
The logical structure of the podcast is clear: first, it reviews Danfei Xu's interest-driven growth journey, including his undergraduate experiences of securing hardcore research opportunities through cold calls and showing up in person, as well as his resolute choice of the robotics direction during his PhD at Stanford; then, it tells the story of how he discovered the immense potential of behavior cloning during his internship at DeepMind, and returned to school to tackle a single-arm teleoperation behavior cloning system with his collaborator; finally, it delves into his currently promoted EgoMimic project, the multimodal value of human data (video, SLAM, tactile, etc.), his full-stack system perspective, and his outlook on the future endgame of the robotics field.
Behavior cloning is extremely effective in robot learning, but was long undervalued due to academia's politically correct bias toward reinforcement learning.
The robot planning path dominated by Large Language Models (LLMs) has fundamental limitations because the symbolic layer is too far from the physical layer to solve robotics' most core problems of fine manipulation and physical common sense.
Human egocentric video data achieves an excellent balance between fidelity and scalability, serving as the most critical fuel for training low-level fine manipulation in robots.
Sub-centimeter-level SLAM and VIO technologies are the engineering moats for extracting action labels from human data, and the state-of-the-art level of this technology is currently monopolized by AR/VR giants like Meta and Apple.
Tactile sensors are currently extremely difficult to form a unified representation due to diverse physical properties and a lack of standardized dimensions; in the future, wrist cameras might replace some tactile functions.
Human data and humanoid robots empower each other: human data gives meaning to the existence of humanoid hardware, while humanoid hardware lowers the difficulty of human-to-robot transfer.
For robotics to reach its "GPT-3 moment" (i.e., achieving a 40%-50% success rate in performing anything a human can do in any scenario), approximately 100 million hours of high-quality human data are needed.
Human data truly rich in physical intelligence is often data generated "unintentionally" in daily life, rather than data deliberately demonstrated in data collection centers to complete specific tasks.
Robotics is a full-stack system problem where algorithms, software, hardware, data collection, and evaluation loops are all indispensable; researchers must have a sufficiently deep understanding of every detail of the system.
Imagine a robotics expert friend sitting next to you who is both hardcore in tech and incredibly down-to-earth. He takes a sip of water and starts chatting with you, laying out the most cutting-edge insiders and future directions of robotics in the most practical terms.
This friend is Danfei Xu. He has been someone who "takes the road less traveled" since childhood. During middle and high school, he bought Taobao parts to tinker with microcontroller cars [00:03:22]. Because of his extreme resistance to exam-oriented education and tests, he was driven almost entirely by interest [00:03:39, 00:04:07]. In high school, he gathered information through QQ groups and ancient forums, managing to DIY his application to Dickinson College in the US [00:06:42, 00:09:26], and later transferred to Columbia University [00:13:35].
Danfei Xu's love for robotics is an obsession with "making things move and touching the hardware firsthand" [00:18:55, 00:19:09]. During his freshman and sophomore years, he was extremely eager to do research, so he did something incredibly crazy: he searched Google for over 20 companies with "robot" in their names and cold-called them one by one to ask if they needed interns [00:14:11, 00:14:24]. In the end, the head of the tactile sensor company SynTouch LLC was moved by this undergraduate who was passionate despite his somewhat stumbling English, and offered him an unpaid internship [00:15:21, 00:15:57]. There, he touched the expensive Shadow dexterous hand for the first time, and even broke a few [00:18:16, 00:18:29]. Later, hearing about a summer program at Carnegie Mellon University (CMU), he drove four hours to knock on a professor's door, secured an internship, and spent his days driving around the streets of Pittsburgh collecting localization data for autonomous driving [00:19:37, 00:20:21, 00:21:31].
In 2015, he went to Stanford to pursue his PhD [00:23:59]. At that time, Stanford was practically a "desert" in robotics, with no concept of "robot learning" as hot as it is today [00:25:12, 00:25:22]. After completing rotations in computer vision, virtual reality, and other directions, his advisor Fei-Fei asked if he wanted to continue working on Scene Graphs. He resolutely refused, saying he had to go back to robotics [00:27:23, 00:28:51].
At this time, academia was in a very peculiar atmosphere: everyone was wildly worshipping reinforcement learning, believing that robots should explore on their own by trial and error (motor babbling) in the environment, just like AlphaGo playing chess [00:30:44, 00:31:02]; whereas "behavior cloning" (which is strong supervised learning—learning exactly how humans do it) was looked down upon as a "shameful" low-level method, with people naturally feeling it wasn't advanced enough and lacked generalization ability [00:31:48, 00:38:13].
But during his internship at DeepMind in 2019, Danfei Xu witnessed a fact firsthand: behavior cloning is actually extremely effective and can solve the vast majority of problems [00:37:42, 00:40:39]. The reason DeepMind forcibly suppressed behavior cloning research at the time was simply because their flagship research direction was reinforcement learning, and behavior cloning was "not politically correct" there [00:37:48, 00:38:07].
To break this "behavior cloning shame" [00:41:57], Danfei Xu returned to school and hit it off with his close friend Ajay [00:41:19, 00:42:21]. They spent three months, working until 3 AM every day [00:49:12, 00:49:24], to build an extremely smooth teleoperation and behavior cloning system on a Franka Panda robotic arm [00:42:37, 00:42:43]. They made some off-the-cuff decisions, such as mounting a wrist camera, using ResNet-18 to extract features, and adding a Recurrent Neural Network (RNN) to introduce historical information [00:43:01 - 00:43:25]. As a result, the robot actually performed a continuous, complex 30-second action sequence that no one had achieved before (pulling a tray out of an oven, putting something in, and closing the oven) [00:43:42]. Although they had to forcibly package some academic novelty to publish the paper [00:44:12], this attempt made them firmly believe that behavior cloning was the way forward.
Now, Danfei Xu is an Assistant Professor at Georgia Tech [00:21:12, 02:06:06]. He points out that traditional robotics relies on humans writing complex dynamic physical equations and then solving optimization problems [01:00:16, 01:00:38]; whereas robot learning is data-driven, replacing those methods with machine learning models [01:00:01]. He believes that what is most overrated in the industry today is algorithms and models themselves, while what is most underrated is the "system"—the engineering implementation of deep hardware-software integration [01:01:37, 01:01:49].
Following this line of thought, he launched the EgoMimic project [01:04:02]. Since behavior cloning requires massive amounts of data, and the most scalable data is "human data" [01:23:53], they collaborated deeply with Meta's Project Aria team. They had collectors wear Meta's smart glasses [01:06:40] to directly collect first-person manipulation videos of humans in daily life, utilizing the glasses' built-in cameras, hand pose estimation, and localization functions [01:06:52].
There is a very core theory here: Danfei Xu breaks down the interaction between humans and the physical world into three sub-problems [01:12:44]: 1. How the world should change (e.g., the cup needs to move forward by 5 cm) [01:13:05]. 2. How a certain body structure interacts with the world to cause this change (e.g., where the hand should pinch the cup) [01:13:29]. 3. How control signals are generated inside the body (e.g., how muscle electromyography signals propagate, or how much force the motor needs to exert) [01:13:16, 01:13:29].
For points 1 and 2, human data and robot data are highly compatible, and robots can directly imitate them [01:13:42, 01:14:00]. However, point 3 (specific joint forces and actuation) cannot be directly learned from video [01:15:01].
Many people might wonder: since massive data is needed, why not use the vast amount of third-person videos on YouTube? Isn't that more readily available? Danfei Xu explains a counterintuitive truth: although YouTube videos are abundant (highly scalable), their data distribution is extremely chaotic, with diverse backgrounds and perspectives, making it incredibly difficult for robots to normalize and align them with their own actions [01:16:22]. Conversely, egocentric video (ego video), though requiring active collection, finds the best sweet spot between "data fidelity" and "scalability" [01:15:58].
To convert human first-person videos into action labels that robots can understand, the collection system must know the precise positions of the human head (camera) and hands in 3D space. This requires sub-centimeter-level SLAM (Simultaneous Localization and Mapping) and VIO (Visual-Inertial Odometry) technologies [01:20:22, 01:21:56]. This extremely high-precision SLAM technology is currently completely monopolized by AR/VR giants like Meta and Apple, and the gap between the open-source community/academia and them is like a chasm [01:24:33, 01:25:10].
Among the various modalities of human data, Danfei Xu gives his ranking of importance: egocentric video is the most important, hand pose is second, and language annotation is third [01:27:37, 01:32:18]. Although tactile and force sensing are fundamentally very important (since a robot is essentially a "force-exerting engine") [01:28:37], current tactile sensor solutions are too numerous and chaotic, lacking standardized representations [01:29:16, 01:29:40]. Therefore, at this stage, we might still have to rely on wrist cameras to replace part of the tactile functions [01:27:53, 01:30:05].
Currently, a transitional form between human data and robot data is also popular in the industry, such as UMI (Universal Manipulation Interface) [01:33:03]. It involves a human holding a mechanical gripper to manipulate objects [01:33:20]. Although this method sacrifices the dexterity of human fingers [01:35:50], it eliminates the "embodiment gap" on the end-effector [01:33:45], allowing the collected data to be directly deployed on robots [01:33:51].
In Danfei Xu's view, human data and humanoid robots empower each other: if a robot does not look like a human, it is very difficult to transfer human data to it; conversely, if a humanoid robot does not use human data, it has a human-like skeleton in vain but has no idea how to perform fine manipulations in the physical world [01:39:16].
He predicts that to achieve truly human-level behavior cloning (i.e., passing the physical world Turing test), approximately 100 million hours of high-quality human data are needed [01:43:23, 01:47:34]. Moreover, this data cannot consist entirely of artificial, staged demonstrations in a studio; it must contain a large amount of "unintentional" data from daily life (such as closing a drawer with an elbow or foot, or avoiding obstacles when grabbing something) [01:49:18, 01:49:30]. This is because only this unintentional data contains humans' richest "physical common sense" [01:49:42, 01:57:07].
Finally, for young researchers or startups wanting to make a mark in this field, he offers very sincere advice: When facing the "buy or build" decision, hardware and data from suppliers can be bought, but the team must absolutely never treat the evaluation and training loop as a black box. It must be deeply integrated internally to understand exactly how each small piece of data changes the system's behavior [02:01:11, 02:01:31, 02:02:06].
In his mind, robotics' "GPT-3 moment" is when a robot can perform anything a human can do in any scenario with a 40% to 50% success rate [02:14:18]. To achieve this goal, people need to shake off the FOMO (fear of missing out) mentality [02:11:26], down-to-earthly cultivate their own "taste" [02:10:23], and be willing to become a "full-stack" researcher who understands everything—even willing to solder a broken motor themselves [02:09:04, 02:09:12].
A faithful reconstruction and plain-language retelling of the episode, generated by PodLens.
This is one source-grounded reading, not a replacement for the original. Every point is anchored to its source, so you can check it yourself — and corrections are welcome.