The Era of Experience: Reinforcement Learning Beyond Human Data · David Silver

2026-06-06 · A faithful, transcript-grounded reading by PodLens

Source paper:https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf

experience streamsgrounded rewardsbi-level optimisationself-playworld models

What This Paper Is About

This paper is written by David Silver and Richard S. Sutton (the text is a preprint of a chapter in the forthcoming book Designing an Intelligence to be published by MIT Press). The paper explores how artificial intelligence (AI) is at a critical turning point, transitioning from the "Era of Human Data" to the "Era of Experience". The authors point out that although current AI (such as Large Language Models, or LLMs) has achieved immense success by training on massive amounts of human-generated data, this approach of relying on imitating humans is approaching its limits in many important domains. To achieve superhuman intelligence, AI must shift to an entirely new paradigm—the "Era of Experience"—where agents learn primarily by continuously interacting with their environment, autonomously generating experience, and utilizing reinforcement learning (RL) algorithms. The paper elaborates on the four core characteristics of the Era of Experience (streams, actions and observations, rewards, planning and reasoning), argues that current technology already possesses the foundation to realize this transition, and explores the potential societal impacts and safety benefits brought by this paradigm.

Paper Skeleton

Problem Solved: The limitations of the Era of Human Data. Progress driven by supervised learning on human data is slowing down as high-quality data is on the verge of depletion, and it cannot generate new insights that go beyond current human understanding.
Anchor: The Era of Human Data · "The pace of progress driven solely by supervised learning"
Core Claim: AI is entering the Era of Experience, where agents will learn continuously primarily through experience generated by their own interactions with the environment, thereby breaking through the limitations of human-centric AI systems to achieve superhuman capabilities.
Anchor: The Era of Experience · "experience will become the dominant medium of improvement"
Argumentation Style: The paper adopts the argumentative chain of a position paper. It first points out the bottleneck of human data through logical reasoning, and then demonstrates the feasibility of experiential learning through recent case studies (such as AlphaProof's performance in the Mathematical Olympiad and DeepSeek-R1's reinforcement learning practices). Subsequently, the authors build a theoretical framework for the Era of Experience across four dimensions (streams, actions and observations, rewards, planning and reasoning), conduct feasibility arguments in conjunction with classic reinforcement learning (RL) concepts, and finally dialectically analyze the consequences of this paradigm regarding safety and societal impact.
Key Evidence & Examples:
AlphaProof: As the first program to reach the silver medal standard in the International Mathematical Olympiad, it autonomously generated 100 million proofs through a reinforcement learning algorithm interacting with a formal proof system, based on 100,000 human-generated formal proofs, thereby exploring mathematical possibilities beyond human preconceptions.
- Anchor: The Era of Experience · "AlphaProof’s reinforcement learning (RL) algorithm subsequently generated"
DeepSeek: Its recent work (DeepSeek-R1) demonstrates the power of reinforcement learning; without explicit teaching, and solely by providing the correct incentives, the model autonomously develops advanced problem-solving strategies.
- Anchor: The Era of Experience · "autonomously develops advanced problem-solving strategies"
Historical Reinforcement Learning Systems: Such as AlphaZero discovering fundamentally new strategies in Go and chess through self-play, proving the agent's ability to self-discover knowledge and its scalability.
- Anchor: Why Now? · "AlphaZero discovered fundamentally new strategies for chess"
Acknowledged Boundaries & Limitations:
Physical Time Constraints: Progress relying on physical experience is inherently constrained by the actual time it takes to execute actions and observe results in the real world, which cannot be achieved overnight.
- Anchor: Consequences · "inherently constrained by the time it takes to"
Reduced Interpretability: Moving away from human data and human thought patterns may make future AI systems harder for humans to understand and interpret.
- Anchor: Consequences · "may also make future AI systems harder to interpret"
No Absolute Guarantee of Alignment: Although reward functions can be adjusted through experience and bi-level optimization to correct biases, there is still no absolute guarantee of perfect alignment with human goals.
- Anchor: Consequences · "there is no guarantee of perfect alignment"

Core Arguments List

The human data dividend is facing its limits and cannot lead AI to superhuman intelligence. - Anchor: The Era of Human Data · "The pace of progress driven solely by supervised learning" - Type: Claim - Author's Reservations: None. The authors explicitly point out that the pace of progress driven solely by supervised learning is slowing down, and human data cannot capture new scientific breakthroughs that go beyond current human understanding.
Agents in the Era of Experience will exist in long-term, uninterrupted streams of experience, rather than brief, single-interaction episodes. - Anchor: Streams · "An experiential agent can continue to learn throughout a lifetime" - Type: Definition - Author's Reservations: None.
Agents will have richer action and observation spaces, interacting autonomously in the real or digital world, rather than being limited to human-privileged formats (such as pure text dialogue). - Anchor: Actions and Observations · "act autonomously in the real world" - Type: Claim - Author's Reservations: None.
Relying on rewards based on human prejudgement sets an insurmountable ceiling on agent performance, whereas grounded rewards from the environment allow agents to discover new strategies that go beyond existing human knowledge. - Anchor: Rewards · "Relying on human prejudgement in this manner usually leads" - Type: Claim - Author's Reservations: None.
A bi-level optimisation process can combine the user's macro-level goals with grounded signals from the environment, allowing the reward function to be flexibly adjusted and alignment biases corrected under user guidance. - Anchor: Rewards · "optimises user feedback as the top-level goal" - Type: Claim - Author's Reservations: The authors point out that this is merely "sketching one possible way in which these requirements might be met" and acknowledge that other viable approaches may exist.
Agents can utilize non-human languages (such as symbols, distributed, or continuous computation) to discover or improve more efficient thinking mechanisms, without being limited to imitating human chains of thought. - Anchor: Planning and Reasoning · "discover or improve such approaches by learning how to" - Type: Prediction - Author's Reservations: None.
Agents must interact with the real world (grounding) to test and overturn erroneous thinking assumptions inherited from human data, avoiding becoming an "echo chamber" of existing knowledge. - Anchor: Planning and Reasoning · "grounding provides a feedback loop, allowing the agent to" - Type: Claim - Author's Reservations: None.
The arrival of the Era of Experience provides an opportunity to revisit and improve classic reinforcement learning concepts (such as value functions, exploration, world models, and temporal abstraction), thereby paving the way to truly superhuman intelligence. - Anchor: Reinforcement Learning Methods · "pave the way to truly superhuman intelligence" - Type: Claim - Author's Reservations: None.
Although experiential learning will increase certain safety risks, it also brings unique safety benefits, such as the agent's ability to dynamically adapt to environmental changes and the incremental correction of its reward function through experience. - Anchor: Consequences · "experiential learning will increase certain safety risks" - Type: Claim - Author's Reservations: The authors acknowledge that the transition to the Era of Experience indeed requires further research to ensure safety.

Plain English Retelling

What is this paper wrestling with? Simply put, it's wrestling with the idea that "AI can only get smarter by imitating humans." Current AI (like various Large Language Models, LLMs) is indeed very impressive—it can write poetry and code—but it is all trained by "reading data written by humans." The authors point out that this trick of relying on imitating humans is almost at its end, because high-quality human data is about to be sucked dry. If we want AI to possess "superhuman intelligence" that surpasses humans, we must let AI live differently—transitioning from the "Era of Human Data" into the "Era of Experience." In other words, AI can no longer just be a "bookworm." Instead, it needs to be like a child, interacting with the world and exploring on its own, learning from its own firsthand "experience" through "reinforcement learning (RL)."

The paper's logic unfolds like this: First, the authors point out the dead end of the "Era of Human Data." Relying solely on imitating humans, the pace of AI's progress has noticeably slowed down (The Era of Human Data · "The pace of progress driven solely by supervised learning"). More importantly, those new scientific theorems and technologies that humans haven't discovered yet simply do not exist in human data, so how could AI learn them just by reading books?

Therefore, AI must shift to the "Era of Experience," generating its own data through interaction with the environment and using this as the engine for continuous progress (The Era of Experience · "experience will become the dominant medium of improvement"). The authors give two fresh examples: One is AlphaProof. When solving Olympiad math problems, humans only gave it 100,000 proofs. It played with the formal proof system on its own and managed to autonomously generate 100 million proofs, exploring mathematical solutions that humans had never even thought of (The Era of Experience · "AlphaProof’s reinforcement learning (RL) algorithm subsequently generated"). The other is the recently wildly popular DeepSeek-R1, which proves the magic of reinforcement learning: we don't need to teach it step-by-step how to think; as long as we give it the right incentives (like giving it a candy when it gets it right), it can figure out super powerful problem-solving strategies on its own (The Era of Experience · "autonomously develops advanced problem-solving strategies").

Next, the authors describe the four core characteristics of the "Era of Experience." These four features are highly counterintuitive and shatter our current understanding of AI:

First, "Streams". Current AI is like a "goldfish brain"—you ask a question, it answers, and once the conversation ends, it forgets everything. But an AI in the Era of Experience possesses a long-term, uninterrupted "stream of experience" just like a human (Streams · "An experiential agent can continue to learn throughout a lifetime"). It can live a lifetime, remember past lessons, and plan every current step for long-term goals months or even years down the road (such as helping a user condition their body, learning a new language, or researching new materials), even if that step seems to have no immediate benefit right now.

Second, "Actions and Observations". Current AI is trapped in a chat box, only able to engage in word battles with humans. Future AI will be like animals, possessing rich action and observation capabilities, able to act autonomously in the real or digital world (Actions and Observations · "act autonomously in the real world"). For example, calling APIs on its own, using human computer interfaces by itself, or even using robotic arms to conduct experiments in a lab. It will no longer just listen to humans, but will be able to explore the world on its own.

Third, "Rewards". This is the most counterintuitive part: the authors believe that relying on humans to score AI (judging good or bad) is actually harming AI. Because if a strategy is something even human experts cannot understand or comprehend, humans will give it a low score, which directly puts a ceiling on the AI's capabilities (Rewards · "Relying on human prejudgement in this manner usually leads"). To surpass humans, the AI's rewards must come from the environment itself (such as whether heart rate has improved, whether it scored points on an exam, or whether a material in a physics simulator is sturdy). So how do we ensure AI doesn't spiral out of control? The authors propose a concept of "bi-level optimisation": the top level is driven by feedback from human users to determine the general direction, while the bottom level is automatically optimized by the AI for specific signals in the environment (Rewards · "optimises user feedback as the top-level goal"). This way, it can both listen to humans and run wild autonomously.

Fourth, "Planning and Reasoning". Current AI uses human language (like "chain of thought") when thinking. But is human language really the smartest way to think? Not necessarily. AI can completely use non-human languages like symbols or continuous computation to invent more efficient thinking mechanisms on its own (Planning and Reasoning · "discover or improve such approaches by learning how to"). More importantly, AI must test its ideas by colliding with the real world. If it only spins around in human data, AI will become a "bias parrot," inheriting all of humanity's biases and mistakes. Only by hitting a brick wall can AI overturn those incorrect assumptions (Planning and Reasoning · "grounding provides a feedback loop, allowing the agent to").

Why now? The authors look back at history. Previous reinforcement learning (such as AlphaZero, which defeated Go masters) could discover fundamentally new strategies on its own (Why Now? · "AlphaZero discovered fundamentally new strategies for chess"), but it could only play in closed simulators (like a chessboard) and couldn't step into the complex real world. Later, everyone found human data to be incredibly convenient, so they all went to work on large models, losing the soul of "autonomously discovering knowledge" in the process. Now, with large models as our foundation and tools that can interact with the real world, it's time to combine the two, pick back up the treasures of classic reinforcement learning (such as value functions, exploration, and world models), and pave the way to truly superhuman intelligence (Reinforcement Learning Methods · "pave the way to truly superhuman intelligence").

Of course, the authors also honestly discuss the costs. AI venturing out on its own will indeed bring safety risks, and once it invents non-human ways of thinking, humans will find it increasingly difficult to understand (Consequences · "may also make future AI systems harder to interpret"). However, experiential learning also brings unique safety benefits: First, it can adjust itself based on environmental changes just like a living person—if hardware breaks or society changes, it can navigate around it on its own. Second, its reward mechanism can be continuously fine-tuned through trial and error; if it senses humans are unhappy, it can stop in time, rather than foolishly turning the entire Earth into paperclips (Consequences · "there is no guarantee of perfect alignment"). Third, conducting experiments in the physical world takes time (such as clinical trials for new drug development), and this physical time constraint will act as a natural brake on AI's self-evolution (Consequences · "inherently constrained by the time it takes to").

Glossary

Supervised Learning: A common way for AI to learn, much like a student memorizing books while looking at standard answers. AI learns how to speak and act like a human by imitating data such as text and code written by humans. - Anchor: The Era of Human Data · "The pace of progress driven solely by supervised learning"
Experience: Rather than reading static books (static data), AI generates a sequence of dynamic data through its own interactions with the environment (such as taking actions, observing results, making mistakes, and adjusting). - Anchor: The Era of Experience · "experience will become the dominant medium of improvement"
Reinforcement Learning / RL: A "trial-and-error" learning mechanism. AI gropes around in the environment on its own, receiving rewards when it does something right and punishments when it does something wrong, thereby continuously optimizing its behavior to obtain more rewards. - Anchor: The Era of Experience · "AlphaProof’s reinforcement learning (RL) algorithm subsequently generated"
Grounded Rewards: Direct feedback signals from the objective environment (such as heart rate, exam scores, or physical sensor values), rather than subjective judgments pulled out of thin air by humans. - Anchor: Rewards · "Relying on human prejudgement in this manner usually leads"
Bi-level Optimisation: A two-level nested optimization method. The upper level is guided by human macro-level feedback (such as "make me healthier") to point the direction, while the lower level is automatically optimized by the AI for specific signals in the environment (such as sleep and heart rate) to achieve that direction. - Anchor: Rewards · "optimises user feedback as the top-level goal"
World Model: A simulator of the laws of the real world in the AI's mind. With it, before taking action, the AI can first deduce in its mind: "If I do this, how will the world change? What reward will I get?" - Anchor: Planning and Reasoning · "predicts the consequences of the agent’s actions upon the world"

Before and After This Paper

Before this paper, the AI field had almost entirely shifted to a "human-centric" paradigm over the past few years (such as Large Language Models, LLMs). Everyone took it for granted that as long as we fed AI all the data written by humans on the entire internet, and then had human experts score and align the AI's answers (RLHF), we could build the smartest AI. In this process, classic reinforcement learning methods (such as world models for AI to deduce in its own mind, and mechanisms for autonomously exploring new behaviors) were marginalized because the shortcut of imitating humans was simply too effective. - Anchor: Reinforcement Learning Methods · "The rise of human-centric LLMs, however, shifted the focus"

This paper, however, issues an assertion: relying solely on imitating humans is almost at its end, because high-quality human data is on the verge of depletion, and imitating humans can never allow AI to generate new discoveries that surpass current human cognition. The authors advocate that AI must step into the "Era of Experience," reactivating and upgrading classic reinforcement learning methods, allowing AI to learn autonomously in long-term "experience streams" interacting with the real world. This directly challenges the current AI R&D roadmap that heavily relies on human data and subjective human judgment, pointing to an entirely new direction toward superhuman intelligence. - Anchor: Reinforcement Learning Methods · "pave the way to truly superhuman intelligence"

Most Worth-Reading Passages from the Original Text

The crisis discourse on "human data depletion" - Anchor: The Era of Human Data · "The majority of high-quality data sources - those that can actually" - Why it's worth reading: This passage hits the nail on the head regarding the hidden worry behind the current boom of large models—the "data wall." It shatters the illusion that "as long as we pile up data, AI can get infinitely smarter" in plain language, serving as the starting point for the entire paper's argument.
The counterintuitive argument that "subjective human judgment sets limits on AI" - Anchor: Rewards · "Relying on human prejudgement in this manner usually leads to an" - Why it's worth reading: This part of the argument is brilliant and highly counterintuitive. We usually think of human feedback (RLHF) as a magic potion to make AI smarter, but the authors point out that if AI can only please human evaluators, it will never discover those great strategies that transcend human understanding.
The argument that "AI must be grounded to break the echo chamber of thought" - Anchor: Planning and Reasoning · "Without this grounding, an agent, no matter how sophisticated, will" - Why it's worth reading: Here, the authors draw on the evolution of human scientific history (from animism to quantum mechanics) to explain why an AI that only spins around in human words will turn into a "bias parrot." To obtain truth, AI must collide with the physical world just like a scientist.
The safety perspective that "physical time is a natural brake on AI's self-evolution" - Anchor: Consequences · "inherently constrained by the time it takes to execute" - Why it's worth reading: When discussing the risk of AI spiraling out of control, this is a very pragmatic and reassuring perspective. It reminds us that even if AI's intelligence explodes, as long as it needs to interact with the physical world (such as conducting chemistry experiments or waiting for crops to grow), it must obey the physical laws of time in the universe and cannot instantly complete infinite self-evolution in a virtual world.

Resonances with past episodes

Corroboration→ Representation Learning and Predictive World Models · Saining Xie
Both point out that human language should not be viewed as the ultimate vehicle for intelligence or thought, and that over-reliance on human language limits the development of artificial intelligence. Agents should explore non-linguistic, more efficient underlying computational and thinking mechanisms.
ThisPlanning and Reasoning · "discover or improve such approaches by learning how to" Agents can use non-human languages (such as symbolic, distributed, or continuous computation) to discover or improve more efficient thinking mechanisms, without being limited to mimicking human chains of thought.
Related[03:53:57 - 04:10:06] Language is a highly compressed communication tool developed by humans, rather than a direct mapping of thought or decision-making; therefore, relying solely on language models creates a "crutch" that limits the development of true intelligence.
Complement→ Representation Learning and Predictive World Models · Saining Xie
Both advocate shifting the focus of artificial intelligence research from human-specific symbolic and textual interactions (such as writing code or chatting) to the more challenging embodied intelligence that autonomously interacts and survives in the real world.
ThisActions and Observations · "act autonomously in the real world" Agents will have richer action and observation spaces to interact autonomously in the real or digital world, rather than being limited to human-privileged formats (such as pure text dialogue).
Related[06:07:44 - 06:13:49] AGI is a false premise... human intelligence is a highly specialized intelligence... building the intelligence of a squirrel is the real challenge
Extension→ The Core Algorithm of AlphaGo · Eric Jang
The former points out the limitations of searching and reasoning in a discrete and combinatorially massive language space, while the latter proposes that agents can use non-human languages (such as continuous or distributed computation) to discover more efficient thinking mechanisms, providing a potential path to overcome the search bottleneck in language space.
ThisPlanning and Reasoning · "discover or improve such approaches by learning how to" Agents can use non-human languages (such as symbolic, distributed, or continuous computation) to discover or improve more efficient thinking mechanisms, without being limited to mimicking human chains of thought.
Related[01:47:45 - 01:50:32] Although MCTS is powerful in Go, applying it directly to open-ended domains like large language model reasoning is difficult. The action space of language is combinatorially larger and less discrete, making exploration heuristics like PUCT ineffective, and defining a reliable intermediate value function for search truncation is also much harder.
Corroboration→ The Core Algorithm of AlphaGo · Eric Jang
The former emphasizes the importance of breaking free from human prejudgement and relying on embodied feedback from the environment, while the latter's explanation of the AlphaGo self-play mechanism is precisely about exploring Go strategies that surpass humans without human guidance, through deterministic win/loss feedback (grounded rewards) provided by the game rules.
ThisRewards · "Relying on human prejudgement in this manner usually leads" Relying on rewards from human prejudgement sets an insurmountable ceiling on agent performance, whereas embodied/grounded rewards from the environment allow agents to discover new strategies that transcend existing human knowledge.
Related[01:02:46 - 01:03:17] The system improves through a self-play reinforcement learning loop, where MCTS acts as a "policy improvement operator". For any given board state, MCTS performs a deep search to generate a better, more confident move distribution than the policy network's initial guess. The policy network is then trained to directly predict this improved distribution.

A faithful reading and plain-language retelling of the paper, generated by PodLens.

This is one source-grounded reading, not a replacement for the original. Every point is anchored to its source, so you can check it yourself — and corrections are welcome.