The Era of Experience: Reinforcement Learning Beyond Human Data · David Silver

2026-06-06 · A faithful, transcript-grounded reading by PodLens

Source paper:https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf

experience streamsgrounded rewardsbi-level optimisationself-playworld models

What This Paper Is About

This paper is written by David Silver and Richard S. Sutton (the text is a preprint of a chapter in the forthcoming book Designing an Intelligence to be published by MIT Press). The paper explores how artificial intelligence (AI) is at a critical turning point, transitioning from the "Era of Human Data" to the "Era of Experience". The authors point out that although current AI (such as Large Language Models, or LLMs) has achieved immense success by training on massive amounts of human-generated data, this approach of relying on imitating humans is approaching its limits in many important domains. To achieve superhuman intelligence, AI must shift to an entirely new paradigm—the "Era of Experience"—where agents learn primarily by continuously interacting with their environment, autonomously generating experience, and utilizing reinforcement learning (RL) algorithms. The paper elaborates on the four core characteristics of the Era of Experience (streams, actions and observations, rewards, planning and reasoning), argues that current technology already possesses the foundation to realize this transition, and explores the potential societal impacts and safety benefits brought by this paradigm.

Paper Skeleton

Problem Solved: The limitations of the Era of Human Data. Progress driven by supervised learning on human data is slowing down as high-quality data is on the verge of depletion, and it cannot generate new insights that go beyond current human understanding.
Anchor: The Era of Human Data · "The pace of progress driven solely by supervised learning"
Core Claim: AI is entering the Era of Experience, where agents will learn continuously primarily through experience generated by their own interactions with the environment, thereby breaking through the limitations of human-centric AI systems to achieve superhuman capabilities.
Anchor: The Era of Experience · "experience will become the dominant medium of improvement"
Argumentation Style: The paper adopts the argumentative chain of a position paper. It first points out the bottleneck of human data through logical reasoning, and then demonstrates the feasibility of experiential learning through recent case studies (such as AlphaProof's performance in the Mathematical Olympiad and DeepSeek-R1's reinforcement learning practices). Subsequently, the authors build a theoretical framework for the Era of Experience across four dimensions (streams, actions and observations, rewards, planning and reasoning), conduct feasibility arguments in conjunction with classic reinforcement learning (RL) concepts, and finally dialectically analyze the consequences of this paradigm regarding safety and societal impact.
Key Evidence & Examples:
AlphaProof: As the first program to reach the silver medal standard in the International Mathematical Olympiad, it autonomously generated 100 million proofs through a reinforcement learning algorithm interacting with a formal proof system, based on 100,000 human-generated formal proofs, thereby exploring mathematical possibilities beyond human preconceptions.
- Anchor: The Era of Experience · "AlphaProof’s reinforcement learning (RL) algorithm subsequently generated"
DeepSeek: Its recent work (DeepSeek-R1) demonstrates the power of reinforcement learning; without explicit teaching, and solely by providing the correct incentives, the model autonomously develops advanced problem-solving strategies.
- Anchor: The Era of Experience · "autonomously develops advanced problem-solving strategies"
Historical Reinforcement Learning Systems: Such as AlphaZero discovering fundamentally new strategies in Go and chess through self-play, proving the agent's ability to self-discover knowledge and its scalability.
- Anchor: Why Now? · "AlphaZero discovered fundamentally new strategies for chess"
Acknowledged Boundaries & Limitations:
Physical Time Constraints: Progress relying on physical experience is inherently constrained by the actual time it takes to execute actions and observe results in the real world, which cannot be achieved overnight.
- Anchor: Consequences · "inherently constrained by the time it takes to"
Reduced Interpretability: Moving away from human data and human thought patterns may make future AI systems harder for humans to understand and interpret.
- Anchor: Consequences · "may also make future AI systems harder to interpret"
No Absolute Guarantee of Alignment: Although reward functions can be adjusted through experience and bi-level optimization to correct biases, there is still no absolute guarantee of perfect alignment with human goals.
- Anchor: Consequences · "there is no guarantee of perfect alignment"

Core Arguments List

The human data dividend is facing its limits and cannot lead AI to superhuman intelligence. - Anchor: The Era of Human Data · "The pace of progress driven solely by supervised learning" - Type: Claim - Author's Reservations: None. The authors explicitly point out that the pace of progress driven solely by supervised learning is slowing down, and human data cannot capture new scientific breakthroughs that go beyond current human understanding.
Agents in the Era of Experience will exist in long-term, uninterrupted streams of experience, rather than brief, single-interaction episodes. - Anchor: Streams · "An experiential agent can continue to learn throughout a lifetime" - Type: Definition - Author's Reservations: None.
Agents will have richer action and observation spaces, interacting autonomously in the real or digital world, rather than being limited to human-privileged formats (such as pure text dialogue). - Anchor: Actions and Observations · "act autonomously in the real world" - Type: Claim - Author's Reservations: None.
Relying on rewards based on human prejudgement sets an insurmountable ceiling on agent performance, whereas grounded rewards from the environment allow agents to discover new strategies that go beyond existing human knowledge. - Anchor: Rewards · "Relying on human prejudgement in this manner usually leads" - Type: Claim - Author's Reservations: None.
A bi-level optimisation process can combine the user's macro-level goals with grounded signals from the environment, allowing the reward function to be flexibly adjusted and alignment biases corrected under user guidance. - Anchor: Rewards · "optimises user feedback as the top-level goal" - Type: Claim - Author's Reservations: The authors point out that this is merely "sketching one possible way in which these requirements might be met" and acknowledge that other viable approaches may exist.
Agents can utilize non-human languages (such as symbols, distributed, or continuous computation) to discover or improve more efficient thinking mechanisms, without being limited to imitating human chains of thought. - Anchor: Planning and Reasoning · "discover or improve such approaches by learning how to" - Type: Prediction - Author's Reservations: None.
Agents must interact with the real world (grounding) to test and overturn erroneous thinking assumptions inherited from human data, avoiding becoming an "echo chamber" of existing knowledge. - Anchor: Planning and Reasoning · "grounding provides a feedback loop, allowing the agent to" - Type: Claim - Author's Reservations: None.
The arrival of the Era of Experience provides an opportunity to revisit and improve classic reinforcement learning concepts (such as value functions, exploration, world models, and temporal abstraction), thereby paving the way to truly superhuman intelligence. - Anchor: Reinforcement Learning Methods · "pave the way to truly superhuman intelligence" - Type: Claim - Author's Reservations: None.
Although experiential learning will increase certain safety risks, it also brings unique safety benefits, such as the agent's ability to dynamically adapt to environmental changes and the incremental correction of its reward function through experience. - Anchor: Consequences · "experiential learning will increase certain safety risks" - Type: Claim - Author's Reservations: The authors acknowledge that the transition to the Era of Experience indeed requires further research to ensure safety.

Plain English Retelling

What is this paper wrestling with? Simply put, it's wrestling with the idea that "AI can only get smarter by imitating humans." Current AI (like various Large Language Models, LLMs) is indeed very impressive—it can write poetry and code—but it is all trained by "reading data written by humans." The authors point out that this trick of relying on imitating humans is almost at its end, because high-quality human data is about to be sucked dry. If we want AI to possess "superhuman intelligence" that surpasses humans, we must let AI live differently—transitioning from the "Era of Human Data" into the "Era of Experience." In other words, AI can no longer just be a "bookworm." Instead, it needs to be like a child, interacting with the world and exploring on its own, learning from its own firsthand "experience" through "reinforcement learning (RL)."

The paper's logic unfolds like this: First, the authors point out the dead end of the "Era of Human Data." Relying solely on imitating humans, the pace of AI's progress has noticeably slowed down (The Era of Human Data · "The pace of progress driven solely by supervised learning"). More importantly, those new scientific theorems and technologies that humans haven't discovered yet simply do not exist in human data, so how could AI learn them just by reading books?

Therefore, AI must shift to the "Era of Experience," generating its own data through interaction with the environment and using this as the engine for continuous progress (The Era of Experience · "experience will become the dominant medium of improvement"). The authors give two fresh examples: One is AlphaProof. When solving Olympiad math problems, humans only gave it 100,000 proofs. It played with the formal proof system on its own and managed to autonomously generate 100 million proofs, exploring mathematical solutions that humans had never even thought of (The Era of Experience · "AlphaProof’s reinforcement learning (RL) algorithm subsequently generated"). The other is the recently wildly popular DeepSeek-R1, which proves the magic of reinforcement learning: we don't need to teach it step-by-step how to think; as long as we give it the right incentives (like giving it a candy when it gets it right), it can figure out super powerful problem-solving strategies on its own (The Era of Experience · "autonomously develops advanced problem-solving strategies").

Next, the authors describe the four core characteristics of the "Era of Experience." These four features are highly counterintuitive and shatter our current understanding of AI:

First, "Streams". Current AI is like a "goldfish brain"—you ask a question, it answers, and once the conversation ends, it forgets everything. But an AI in the Era of Experience possesses a long-term, uninterrupted "stream of experience" just like a human (Streams · "An experiential agent can continue to learn throughout a lifetime"). It can live a lifetime, remember past lessons, and plan every current step for long-term goals months or even years down the road (such as helping a user condition their body, learning a new language, or researching new materials), even if that step seems to have no immediate benefit right now.

Second, "Actions and Observations". Current AI is trapped in a chat box, only able to engage in word battles with humans. Future AI will be like animals, possessing rich action and observation capabilities, able to act autonomously in the real or digital world (Actions and Observations · "act autonomously in the real world"). For example, calling APIs on its own, using human computer interfaces by itself, or even using robotic arms to conduct experiments in a lab. It will no longer just listen to humans, but will be able to explore the world on its own.

Third, "Rewards". This is the most counterintuitive part: the authors believe that relying on humans to score AI (judging good or bad) is actually harming AI. Because if a strategy is something even human experts cannot understand or comprehend, humans will give it a low score, which directly puts a ceiling on the AI's capabilities (Rewards · "Relying on human prejudgement in this manner usually leads"). To surpass humans, the AI's rewards must come from the environment itself (such as whether heart rate has improved, whether it scored points on an exam, or whether a material in a physics simulator is sturdy). So how do we ensure AI doesn't spiral out of control? The authors propose a concept of "bi-level optimisation": the top level is driven by feedback from human users to determine the general direction, while the bottom level is automatically optimized by the AI for specific signals in the environment (Rewards · "optimises user feedback as the top-level goal"). This way, it can both listen to humans and run wild autonomously.

Fourth, "Planning and Reasoning". Current AI uses human language (like "chain of thought") when thinking. But is human language really the smartest way to think? Not necessarily. AI can completely use non-human languages like symbols or continuous computation to invent more efficient thinking mechanisms on its own (Planning and Reasoning · "discover or improve such approaches by learning how to"). More importantly, AI must test its ideas by colliding with the real world. If it only spins around in human data, AI will become a "bias parrot," inheriting all of humanity's biases and mistakes. Only by hitting a brick wall can AI overturn those incorrect assumptions (Planning and Reasoning · "grounding provides a feedback loop, allowing the agent to").

Why now? The authors look back at history. Previous reinforcement learning (such as AlphaZero, which defeated Go masters) could discover fundamentally new strategies on its own (Why Now? · "AlphaZero discovered fundamentally new strategies for chess"), but it could only play in closed simulators (like a chessboard) and couldn't step into the complex real world. Later, everyone found human data to be incredibly convenient, so they all went to work on large models, losing the soul of "autonomously discovering knowledge" in the process. Now, with large models as our foundation and tools that can interact with the real world, it's time to combine the two, pick back up the treasures of classic reinforcement learning (such as value functions, exploration, and world models), and pave the way to truly superhuman intelligence (Reinforcement Learning Methods · "pave the way to truly superhuman intelligence").

Of course, the authors also honestly discuss the costs. AI venturing out on its own will indeed bring safety risks, and once it invents non-human ways of thinking, humans will find it increasingly difficult to understand (Consequences · "may also make future AI systems harder to interpret"). However, experiential learning also brings unique safety benefits: First, it can adjust itself based on environmental changes just like a living person—if hardware breaks or society changes, it can navigate around it on its own. Second, its reward mechanism can be continuously fine-tuned through trial and error; if it senses humans are unhappy, it can stop in time, rather than foolishly turning the entire Earth into paperclips (Consequences · "there is no guarantee of perfect alignment"). Third, conducting experiments in the physical world takes time (such as clinical trials for new drug development), and this physical time constraint will act as a natural brake on AI's self-evolution (Consequences · "inherently constrained by the time it takes to").

Glossary

Supervised Learning: A common way for AI to learn, much like a student memorizing books while looking at standard answers. AI learns how to speak and act like a human by imitating data such as text and code written by humans. - Anchor: The Era of Human Data · "The pace of progress driven solely by supervised learning"
Experience: Rather than reading static books (static data), AI generates a sequence of dynamic data through its own interactions with the environment (such as taking actions, observing results, making mistakes, and adjusting). - Anchor: The Era of Experience · "experience will become the dominant medium of improvement"
Reinforcement Learning / RL: A "trial-and-error" learning mechanism. AI gropes around in the environment on its own, receiving rewards when it does something right and punishments when it does something wrong, thereby continuously optimizing its behavior to obtain more rewards. - Anchor: The Era of Experience · "AlphaProof’s reinforcement learning (RL) algorithm subsequently generated"
Grounded Rewards: Direct feedback signals from the objective environment (such as heart rate, exam scores, or physical sensor values), rather than subjective judgments pulled out of thin air by humans. - Anchor: Rewards · "Relying on human prejudgement in this manner usually leads"
Bi-level Optimisation: A two-level nested optimization method. The upper level is guided by human macro-level feedback (such as "make me healthier") to point the direction, while the lower level is automatically optimized by the AI for specific signals in the environment (such as sleep and heart rate) to achieve that direction. - Anchor: Rewards · "optimises user feedback as the top-level goal"
World Model: A simulator of the laws of the real world in the AI's mind. With it, before taking action, the AI can first deduce in its mind: "If I do this, how will the world change? What reward will I get?" - Anchor: Planning and Reasoning · "predicts the consequences of the agent’s actions upon the world"

Before and After This Paper

Before this paper, the AI field had almost entirely shifted to a "human-centric" paradigm over the past few years (such as Large Language Models, LLMs). Everyone took it for granted that as long as we fed AI all the data written by humans on the entire internet, and then had human experts score and align the AI's answers (RLHF), we could build the smartest AI. In this process, classic reinforcement learning methods (such as world models for AI to deduce in its own mind, and mechanisms for autonomously exploring new behaviors) were marginalized because the shortcut of imitating humans was simply too effective. - Anchor: Reinforcement Learning Methods · "The rise of human-centric LLMs, however, shifted the focus"

This paper, however, issues an assertion: relying solely on imitating humans is almost at its end, because high-quality human data is on the verge of depletion, and imitating humans can never allow AI to generate new discoveries that surpass current human cognition. The authors advocate that AI must step into the "Era of Experience," reactivating and upgrading classic reinforcement learning methods, allowing AI to learn autonomously in long-term "experience streams" interacting with the real world. This directly challenges the current AI R&D roadmap that heavily relies on human data and subjective human judgment, pointing to an entirely new direction toward superhuman intelligence. - Anchor: Reinforcement Learning Methods · "pave the way to truly superhuman intelligence"

Most Worth-Reading Passages from the Original Text

The crisis discourse on "human data depletion" - Anchor: The Era of Human Data · "The majority of high-quality data sources - those that can actually" - Why it's worth reading: This passage hits the nail on the head regarding the hidden worry behind the current boom of large models—the "data wall." It shatters the illusion that "as long as we pile up data, AI can get infinitely smarter" in plain language, serving as the starting point for the entire paper's argument.
The counterintuitive argument that "subjective human judgment sets limits on AI" - Anchor: Rewards · "Relying on human prejudgement in this manner usually leads to an" - Why it's worth reading: This part of the argument is brilliant and highly counterintuitive. We usually think of human feedback (RLHF) as a magic potion to make AI smarter, but the authors point out that if AI can only please human evaluators, it will never discover those great strategies that transcend human understanding.
The argument that "AI must be grounded to break the echo chamber of thought" - Anchor: Planning and Reasoning · "Without this grounding, an agent, no matter how sophisticated, will" - Why it's worth reading: Here, the authors draw on the evolution of human scientific history (from animism to quantum mechanics) to explain why an AI that only spins around in human words will turn into a "bias parrot." To obtain truth, AI must collide with the physical world just like a scientist.
The safety perspective that "physical time is a natural brake on AI's self-evolution" - Anchor: Consequences · "inherently constrained by the time it takes to execute" - Why it's worth reading: When discussing the risk of AI spiraling out of control, this is a very pragmatic and reassuring perspective. It reminds us that even if AI's intelligence explodes, as long as it needs to interact with the physical world (such as conducting chemistry experiments or waiting for crops to grow), it must obey the physical laws of time in the universe and cannot instantly complete infinite self-evolution in a virtual world.

Resonances with past episodes

Corroboration→ Representation Learning and Predictive World Models · Saining Xie
Both point out that human language should not be viewed as the ultimate vehicle for intelligence or thought, and that over-reliance on human language limits the development of artificial intelligence. Agents should explore non-linguistic, more efficient underlying computational and thinking mechanisms.
ThisPlanning and Reasoning · "discover or improve such approaches by learning how to" Agents can use non-human languages (such as symbolic, distributed, or continuous computation) to discover or improve more efficient thinking mechanisms, without being limited to mimicking human chains of thought.
Related[03:53:57 - 04:10:06] Language is a highly compressed communication tool developed by humans, rather than a direct mapping of thought or decision-making; therefore, relying solely on language models creates a "crutch" that limits the development of true intelligence.
Complement→ Representation Learning and Predictive World Models · Saining Xie
Both advocate shifting the focus of artificial intelligence research from human-specific symbolic and textual interactions (such as writing code or chatting) to the more challenging embodied intelligence that autonomously interacts and survives in the real world.
ThisActions and Observations · "act autonomously in the real world" Agents will have richer action and observation spaces to interact autonomously in the real or digital world, rather than being limited to human-privileged formats (such as pure text dialogue).
Related[06:07:44 - 06:13:49] AGI is a false premise... human intelligence is a highly specialized intelligence... building the intelligence of a squirrel is the real challenge
Extension→ The Core Algorithm of AlphaGo · Eric Jang
The former points out the limitations of searching and reasoning in a discrete and combinatorially massive language space, while the latter proposes that agents can use non-human languages (such as continuous or distributed computation) to discover more efficient thinking mechanisms, providing a potential path to overcome the search bottleneck in language space.
ThisPlanning and Reasoning · "discover or improve such approaches by learning how to" Agents can use non-human languages (such as symbolic, distributed, or continuous computation) to discover or improve more efficient thinking mechanisms, without being limited to mimicking human chains of thought.
Related[01:47:45 - 01:50:32] Although MCTS is powerful in Go, applying it directly to open-ended domains like large language model reasoning is difficult. The action space of language is combinatorially larger and less discrete, making exploration heuristics like PUCT ineffective, and defining a reliable intermediate value function for search truncation is also much harder.
Corroboration→ The Core Algorithm of AlphaGo · Eric Jang
The former emphasizes the importance of breaking free from human prejudgement and relying on embodied feedback from the environment, while the latter's explanation of the AlphaGo self-play mechanism is precisely about exploring Go strategies that surpass humans without human guidance, through deterministic win/loss feedback (grounded rewards) provided by the game rules.
ThisRewards · "Relying on human prejudgement in this manner usually leads" Relying on rewards from human prejudgement sets an insurmountable ceiling on agent performance, whereas embodied/grounded rewards from the environment allow agents to discover new strategies that transcend existing human knowledge.
Related[01:02:46 - 01:03:17] The system improves through a self-play reinforcement learning loop, where MCTS acts as a "policy improvement operator". For any given board state, MCTS performs a deep search to generate a better, more confident move distribution than the policy network's initial guess. The policy network is then trained to directly predict this improved distribution.
Gopnik diagnoses the limitation (AI as text-only cultural technology, 'Derrida's revenge'); the Era of Experience paper proposes the engineering path out — agents generating their own experiential data through active environmental interaction← How Children Learn and What AI Actually Is · Alison Gopnik
This
Related
both argue that agents must actively interact with environments to generate experiential data — Efros's deep data flywheel and Silver/Sutton's 'era of experience' are the same thesis from different traditions; Efros even cites Sutton's newest paper as 'very much agreeing with us'← Surface Data vs. Deep Data · Alexei Efros
This
Related
Corroboration← World Models and Real-World Intelligence · Yann LeCun
Both believe that being limited to pure text dialogue cannot generate true general intelligence; AI must break free from a single text modality and enter the real physical world, which contains richer actions and observations, to interact.
ThisActions and Observations · "act autonomously in the real world" Agents will have richer action and observation spaces, interacting autonomously in the real or digital world, rather than being limited to human-privileged forms (such as pure text dialogue).
Related[10:50 - 11:52] Text-only training cannot bring about true human-level intelligence, because pure text lacks the physical world mapping of embodied common sense.
complement← The Nature of Reality, Dreams, and Consciousness · Joscha Bach
DeepMind's experiential AI thesis — agents must model the world through long, continuous streams of experience — directly echoes Bach's claim that consciousness is the real-time simulation self-model built to minimize prediction error.
ThisExperiential agents will exist in long, uninterrupted streams of experience (not brief single interactions), acting autonomously in real or digital worlds — precisely matching Bach's claim that consciousness requires continuous experiential modeling to exist.
Related[20:00-25:32] Humans perceive not atoms but a 3D game engine generated by the brain to minimize surprise — consciousness lives inside this simulated story, not in physical reality at the atomic level.
Continuation← Unified Intelligence and Physical World Simulators · Amit Jain
Both point out that the evolution of artificial intelligence is moving beyond simple text interaction and must integrate spatiotemporal perception of the physical world and autonomous interaction capabilities to achieve true end-to-end multimodal work.
ThisActions and Observations · "act autonomously in the real world" Agents will have richer action and observation spaces, interacting autonomously in the real or digital world, rather than being limited to human-privileged forms (such as pure text dialogue).
Related[19:18-19:54] The AI competition in 2026 has moved beyond simple text or video generation toward "end-to-end multimodal work," which requires models to simultaneously possess language reasoning capabilities and spatiotemporal perception of the physical world.
Isomorphism← Unified Intelligence and Physical World Simulators · Amit Jain
Both point out that the practical future of intelligence lies not in single, static outputs (such as single image generation or a single Q&A), but in understanding and acting within long-term, temporally consistent continuous streams of interaction.
ThisStreams · "An experiential agent can continue to learn throughout a lifetime" Experiential agents will exist in long-term, uninterrupted streams of experience, rather than brief, single interaction segments.
Related[54:35-56:15] The biggest bottleneck for visual models to become general-purpose and practical lies in "intelligence" (including multi-turn interaction capabilities, temporal consistency, and physical causal understanding), rather than pure pixel generation aesthetics.
Extension← The Physical Foundations of Visual Intelligence and the Multimodal Flywheel · Andreas Blattmann
Both emphasize the core role of 'grounding' in breaking through the limitations of single modalities or pure human data, pointing out that feedback loops must be provided through cross-modal correlation or real physical interaction to deepen the true understanding of physical laws.
ThisPlanning and Reasoning · "grounding provides a feedback loop, allowing the agent to" Agents must test and overturn incorrect cognitive assumptions inherited from human data by interacting with the real world (embodiment), avoiding becoming an 'echo chamber' of existing knowledge.
Related[16:01-17:48] Cross-modal correlation can generate compounding effects for multimodal models and deepen their understanding of the physical world. For example, by training images, video, and audio simultaneously through the Self-Flow framework, the model can observe strong correlations between object collisions (actions) and sounds (noise). This physical grounding is unattainable for unimodal models.
Supplements← Frontier Systems Compute and the Context Loop War · Anjney Midha
The two complement each other mechanistically: the former points out that the success or failure of reinforcement learning depends on the verifiability of the domain itself, while the latter theoretically explains why—only by breaking free from human subjective prejudgment and relying on grounded feedback from the objective environment can agents break through bottlenecks and achieve true exponential self-improvement.
ThisRewards · "Relying on human prejudgement in this manner usually leads" Relying on rewards from human prejudgment sets an insurmountable ceiling on agent performance, whereas grounded rewards from the environment allow agents to discover new strategies that surpass existing human knowledge.
Related[38:39-39:35] The pace of progress in reinforcement learning (RL) at the frontier is directly proportional to the verifiability of the domain. In domains with clear unit tests or physical metrics like code and materials science, AI can achieve exponential self-improvement; however, in hard-to-verify domains like aesthetics and creative writing, it easily falls into mediocrity and hallucination.
Builds on← Action-Level Mental Model Dataset for Agent Collaboration · Jiaju Chen
The former points out that current agent designs are limited to independent, single-turn task execution, thus lacking process-level collaborative data; in the context of human-agent interaction, this builds on the latter's evolutionary direction that agents must break free from brief, single-turn interaction snippets and move toward a 'long-term and continuous stream of experience.'
ThisStreams · "An experiential agent can continue to learn throughout a lifetime" Agents in the era of experience will exist in long-term, continuous streams of experience, rather than brief, single-turn interaction snippets.
Related1. Introduction · "primarily optimized for task completion" Current LLM agent designs are primarily optimized for independent task completion, leading to a lack of process-level collaborative data in academia.
Complement← Agent Memory: System Characterization and Implications for Long-Horizon Workloads · Yasmine Omri
David Silver proposed that future agents will exist in long-term, uninterrupted streams of experience for continuous learning; this study provides quantitative support from the system level, pointing out that when digesting such uninterrupted streams of experience, the energy consumed by memory construction will absolutely dominate the agent's physical lifecycle.
ThisStreams · "An experiential agent can continue to learn throughout a lifetime" Agents in the experiential era will exist in long-term, uninterrupted streams of experience, rather than brief, single interaction episodes.
Related4.2. Construction Dominates the Agent Lifecycle · "exceeds total query-phase energy across 300" For LLM-mediated memory systems, the energy consumed by memory construction dominates the vast majority of the agent's lifecycle.

Tensions with past episodes

ContradictionDirect conflict← World Models and Real-World Intelligence · Yann LeCun
Regarding how to achieve superhuman intelligence, LeCun believes that reinforcement learning should be marginalized due to its extremely low sample efficiency, and the core should rely on self-supervised observation to build world models; whereas Silver views reinforcement learning (based on the stream of experience from embodied environmental rewards) as the fundamental path to reaching and surpassing human intelligence.
ThisReinforcement Learning Methods · "pave the way to truly superhuman intelligence" The arrival of the era of experience provides an opportunity to re-examine and improve classical reinforcement learning concepts (such as value functions, exploration, world models, and temporal abstraction), thereby paving the way to truly superhuman intelligence.
Related[52:21 - 52:58] Reinforcement learning is extremely inefficient and its use should be minimized on the basis of complete feature representation. Most learning should build a world model through observation, and then use reinforcement learning at the top level after obtaining excellent representations.
ContrastApparent tension← Unified Intelligence and Physical World Simulators · Amit Jain
The former advocates that passive observation of video data is sufficient to effectively train AI's understanding and simulation of the physical world, whereas the latter emphasizes that relying solely on static observational data leads to a knowledge echo chamber, and agents must correct hypotheses through active interaction feedback with the real world.
ThisPlanning and Reasoning · "grounding provides a feedback loop, allowing the agent to" Agents must test and overturn incorrect cognitive assumptions inherited from human data through interaction with the real world (grounding), avoiding becoming an "echo chamber" of existing knowledge.
Related[08:05-08:28] Video contains physical laws of space (2D) and time (1D), serving as an important medium for the human brain to understand 3D physical representations; therefore, learning through video can effectively train AI's understanding and simulation of the physical world.
ContradictionDirect conflict← Human Data and Robotics' GPT-3 Moment · Danfei Xu
The former believes that direct supervised imitation of human data (behavior cloning) is sufficient to solve robot control problems, whereas the latter argues that supervised learning imitating human data has a hard ceiling and that true intelligence breakthroughs must be achieved through reinforcement learning and environmental interaction.
ThisThe Era of Human Data · "The pace of progress driven solely by supervised learning" is slowing down, the human data dividend is facing its limits, and it cannot lead agents to superhuman intelligence; there must be reliance on reinforcement learning in the environment.
Related[40:40 - 40:47] Doing behavior cloning (BC) directly makes the robot work, no need for reinforcement learning (RL); past academic bias toward reinforcement learning forcibly suppressed the effectiveness of behavior cloning.

A faithful reading and plain-language retelling of the paper, generated by PodLens.

This is one source-grounded reading, not a replacement for the original. Every point is anchored to its source, so you can check it yourself — and corrections are welcome.