The Core Algorithm of AlphaGo · Eric Jang

2026-06-04 · A faithful, transcript-grounded reading by PodLens

Original episode:https://www.youtube.com/watch?v=X_ZVSPcZhtw&t=1082s　·　Timestamps are clickable — they seek the player in place

Monte Carlo Tree SearchPolicy and Value NetworksSelf-Play Reinforcement LearningAmortized SearchPolicy Improvement Operator

What This Episode Covers

Guest Eric Jang, a researcher who recently rebuilt a Go-playing AI, explains the core principles of AlphaGo. The episode walks through the rules of Go, the intractability of its game tree for classical search, and how AlphaGo solves this using a combination of Monte Carlo Tree Search (MCTS) and two neural networks (a policy network and a value network). The central mechanism discussed is the self-play reinforcement learning loop, where the MCTS process acts as a "policy improvement operator," generating better move distributions that are then used as training targets to distill the power of search into the neural network. The episode contrasts this sample-efficient approach with the high-variance methods used for LLMs and explores the broader implications for AI research, including the nature of computational complexity and the potential for automated science.

Timeline & Topic Map

[00:00:00 - 00:02:13] Introduction to guest Eric Jang and his sabbatical project of rebuilding and improving AlphaGo.
[00:02:14 - 00:08:03] Explanation of the game of Go, including its rules, objective, and the computer-friendly Tromp-Taylor scoring system.
[00:08:04 - 00:12:22] The intractability of Go's massive game tree and why it was long considered an unsolvable problem for computers.
[00:12:23 - 00:24:23] A breakdown of Monte Carlo Tree Search (MCTS), the core search algorithm, including the UCB1 and PUCT action-selection criteria for balancing exploration and exploitation.
[00:24:24 - 00:31:53] The role of neural networks in making the search tractable: a value network to estimate win probability (pruning depth) and a policy network to suggest good moves (pruning breadth).
[00:31:54 - 00:43:09] Details on the neural network architecture (ResNets vs. Transformers) and the importance of initializing with expert data to bootstrap the learning process.
[00:43:10 - 00:58:51] The four-step MCTS process (selection, expansion, evaluation, backup) as implemented in AlphaGo, integrating the neural networks.
[01:00:23 - 01:07:29] The reinforcement learning (RL) self-play loop: how MCTS is used to generate improved policy targets, which the network then learns to predict, effectively amortizing search into a single forward pass.
[01:07:30 - 01:14:46] Practical training considerations, such as the necessity of an accurate value function for MCTS to be effective and strategies for good initialization.
[01:14:47 - 01:24:24] The philosophical significance of AlphaGo: how a shallow neural network can approximate an intractable search problem, touching on P vs. NP and chaotic systems.
[01:24:25 - 01:37:19] A comparison between AlphaGo's RL method and the policy gradient methods used for LLMs, highlighting the former's lower variance and higher sample efficiency.
[01:37:20 - 01:51:02] Discussion of alternative RL approaches for when MCTS is not feasible, such as Neural Fictitious Self-Play, and the relationship between MCTS and Q-learning.
[01:51:03 - 02:00:07] Reflections on rebuilding AlphaGo, including the dramatic reduction in compute cost over time and the applicability of the "Bitter Lesson."
[02:00:08 - 02:22:01] An analysis of off-policy learning in AlphaGo's replay buffer and a summary of why its RL algorithm is so elegant and efficient.
[02:22:02 - 02:35:58] Using the project as a case study for automated AI research, detailing what LLM assistants are currently good at (e.g., hyperparameter tuning) and where they fall short (e.g., high-level strategy).
[02:35:59 - 02:37:12] Conclusion and links to Eric Jang's project repository and related blog posts.

Key Claims

The game of Go was long considered intractable for AI due to its massive search space (~361³⁰⁰ possible games), but it was solved by using deep learning to intelligently prune the search tree rather than exhaustively exploring it. Evidence [00:58:58 - 01:00:06] Type Fact
AlphaGo's core is a Monte Carlo Tree Search (MCTS) algorithm guided by two neural networks. A value network estimates the probability of winning from a given board state, which allows the search to be truncated early. A policy network suggests promising moves, narrowing the search breadth from all legal moves to a handful of good ones. Evidence [00:27:30 - 00:28:15] Type Fact
The system improves via a self-play reinforcement learning loop where MCTS acts as a "policy improvement operator." For any given board state, the MCTS performs a deep search to generate a better, more confident move distribution than the policy network's initial guess. The policy network is then trained to directly predict this improved distribution. Evidence [01:02:46 - 01:03:17] Type Fact
This RL training process is exceptionally stable and sample-efficient because it generates a low-variance supervision signal for every single move in every game, regardless of the final outcome. It relabels actions with a "better" action distribution from the search, a process analogous to the DAgger algorithm in robotics. Evidence [01:05:49 - 01:07:18] Type Opinion
This method contrasts sharply with the policy gradient RL commonly used for LLMs, which suffers from high variance. LLM RL often relies on a single, sparse reward signal at the end of a long trajectory (e.g., win/loss), making it difficult to assign credit and learn efficiently, a problem described as "sucking supervision through a straw". Evidence [01:28:29 - 01:28:50] Type Opinion
A profound insight from AlphaGo is that a shallow neural network can learn to "amortize" the computation of a vast, nearly intractable search. This ability to compress a complex, sequential reasoning process into a single, parallelized forward pass challenges intuitions about the practical hardness of problems that are NP-hard in the worst case. Evidence [01:17:31 - 01:18:46] Type Opinion
The compute required to build a world-class Go AI has fallen dramatically. What originally required a large DeepMind team and massive compute can now be replicated by an individual for a few thousand dollars, due to algorithmic refinements (e.g., in KataGo) and hardware improvements. Evidence [00:01:49 - 00:02:13] Type Fact
Successful self-play training is critically dependent on having an accurate value function. If the value network gives poor estimates of win probability at the leaves of the search tree, the entire MCTS process can be corrupted, leading to worse-than-initial policy recommendations. This makes good initialization (e.g., from expert data) essential. Evidence [01:08:54 - 01:09:19] Type Opinion
While MCTS is powerful for Go, its direct application to open-ended domains like LLM reasoning is difficult. The action space of language is combinatorially larger and less discrete, making exploration heuristics like PUCT ineffective, and it is much harder to define a reliable, intermediate value function to truncate the search. Evidence [01:47:45 - 01:50:32] Type Opinion
Using LLMs for automated research on this project revealed that they excel at well-defined, local optimization tasks like hyperparameter tuning and executing described experiments. However, they currently lack the high-level strategic and lateral thinking needed to identify flawed research directions, debug complex systems, or propose fundamentally new approaches. Evidence [02:23:13 - 02:25:40] Type Example

In Plain Language

This episode is a deep dive into how AlphaGo, the AI that mastered the game of Go, actually works. The guest, Eric Jang, recently took on the project of rebuilding it himself, and he walks us through the core concepts from the ground up.

First, a quick primer on Go. It's a board game where two players, Black and White, place stones on a grid to surround and capture territory. The rules are simple, but the strategy is incredibly deep. For a computer, the main challenge is the sheer number of possible games. On a standard 19x19 board, the "game tree" of all possible move sequences is astronomically large—something like 361 to the power of 300, a number far greater than the number of atoms in the universe [10:48]. This is why for decades, experts believed a computer could never beat a top human player; a simple brute-force search was out of the question.

AlphaGo's solution wasn't to search the entire tree, but to search it smarter. The core algorithm it uses is called Monte Carlo Tree Search, or MCTS. Instead of building out the whole tree, for each move, the AI runs thousands of mini-simulations, exploring different paths into the future of the game. A key challenge in this process is balancing "exploitation" (following paths that have seemed promising in past simulations) with "exploration" (trying out new, less-traveled paths that might be surprisingly good). A formula called PUCT helps the AI make this trade-off at every step of its search [15:55].

But even MCTS on its own is too slow for a game this complex. This is where the deep learning breakthrough comes in. AlphaGo uses two neural networks to mimic human intuition and radically speed up the search:

The Value Network: This network looks at any given board position and estimates the probability of winning from that state [25:16]. This is a massive shortcut. Instead of simulating a game all the way to the end to see who wins, the AI can just ask the value network for a quick guess. This effectively "prunes the depth" of the search, allowing it to stop early.
The Policy Network: This network looks at a board and suggests a handful of the most promising moves [32:17]. Instead of having to consider all 300+ legal moves, the search can focus on the few that the policy network's "intuition" says are good. This "prunes the breadth" of the search.

So, for every single move it has to make, the AI performs this MCTS process, which is a four-step loop repeated thousands of times [45:13]: 1. Selection: It starts at the current board and travels down the tree of moves it has already explored, using the PUCT formula to guide its path. 2. Expansion: When it reaches a state it hasn't seen before in its search, it "expands" the tree by considering the possible next moves. 3. Evaluation: It uses the value network to get a quick score for this new, unexplored state. 4. Backup: It takes that score and propagates it all the way back up the path it came from, updating the average win-rate for all the moves on that path.

After thousands of these simulations, the AI has a very good idea of which opening move is best, and it plays that move. Then, for the next turn, it throws away that entire search tree and starts the whole process over from the new board state [29:17].

Here's the most elegant part: how the system learns and improves by playing against itself. This is the reinforcement learning (RL) loop. For any given board state, the policy network makes an initial, "instinctive" guess about the best moves. But then, the MCTS process runs its deep search and comes up with a better, more confident distribution of good moves [01:00:43]. The key insight of AlphaGo is to use this search-improved result as the new "correct answer." The policy network is then trained to directly predict this more refined outcome [01:02:53].

Essentially, the slow, computationally expensive work of the search is "distilled" or "amortized" into the fast, single-pass intuition of the neural network. The network learns to have the wisdom that the search provides. This process is incredibly efficient. Unlike the RL methods often used for large language models (LLMs), which might only get a single "you won" or "you lost" signal after a very long sequence of actions—a problem described as "sucking supervision through a straw" [01:28:36]—AlphaGo's method generates a high-quality training signal for every single move in every game, win or lose [01:05:49]. This makes the learning process extremely stable.

The most profound philosophical takeaway from AlphaGo is that a relatively simple, shallow neural network can learn to approximate the result of a mind-bogglingly vast search [01:17:31]. This challenges our ideas about what makes a problem computationally "hard." It suggests that many problems that are technically intractable in the worst case, like Go or protein folding, may have enough underlying structure that a neural network can find excellent solutions quickly.

Finally, Eric Jang reflects on using LLM assistants for this research project. He found them to be excellent at well-defined, local tasks like tuning hyperparameters or running a clearly described experiment [02:23:13]. However, they currently lack the high-level, strategic ability to realize a whole line of research is a dead end, debug complex system-wide issues, or propose fundamentally new approaches [02:25:22].

Worth a Second Listen

[01:02:46 - 01:04:25] The Self-Play Learning Loop. This is the absolute core of how AlphaGo improves. Eric Jang clearly explains how the slow, deliberate MCTS search generates a "better" set of move probabilities, which then becomes the training target for the fast, intuitive policy network. Hearing this is key to understanding how search is "amortized" into the network.
[01:17:31 - 01:18:46] The Philosophical Insight. Here, the conversation zooms out to the broader significance of AlphaGo. Eric explains why it's so profound that a shallow neural network can approximate a nearly intractable search problem, connecting it to fundamental questions in computer science like P vs. NP. It's a moment that elevates the discussion from technical details to deep implications.
[01:28:29 - 01:28:50] The Contrast with LLM Reinforcement Learning. This segment provides a sharp, intuitive contrast between AlphaGo's highly efficient learning method and the high-variance, "sucking supervision through a straw" approach common in LLM training. It clearly articulates why AlphaGo's RL algorithm is so elegant and stable.
[02:25:00 - 02:25:40] Limits of AI in Automated Research. Using his own project as a case study, Eric gives a grounded and nuanced take on what LLM assistants are good at (local, well-defined tasks) and where they currently fall short (high-level strategy, lateral thinking, and complex debugging). It's a valuable, real-world perspective on the current state of automated science.

Resonances with past episodes

Parallel→ The Unification of Physics · Don Lincoln
Both claims highlight a problem domain (Go, fundamental physics) whose state space is too vast for exhaustive exploration (combinatorial explosion of moves, immense energy gap to the Planck scale), necessitating alternative, non-brute-force approaches.
This[00:58:58 - 01:00:06] The game of Go was long considered intractable for AI due to its massive search space (~361³⁰⁰ possible games), but it was solved by using deep learning to intelligently prune the search tree rather than exhaustively exploring it.
Related[1:15:32], [1:25:51] A "Theory of Everything" is likely centuries away because the energy scales where it is predicted to apply are a quadrillion times higher than what is currently testable, and it is arrogant to assume no new, unexpected physics will be discovered in that vast gap.
Parallel→ The Three Realms of the Universe · Hakeem Oluseyi
Both describe a compact, non-intuitive mathematical abstraction (a neural network's forward pass, the wave function) that successfully predicts the outcome of a vastly more complex, underlying process (an intractable game-tree search, the behavior of a quantum entity).
This[01:17:31 - 01:18:46] A profound insight from AlphaGo is that a shallow neural network can learn to "amortize" the computation of a vast, nearly intractable search. This ability to compress a complex, sequential reasoning process into a single, parallelized forward pass challenges intuitions about the practical hardness of problems that are NP-hard in the worst case.
Related[06:30-07:47] The true physical nature of a quantum entity is unknown; physicists use different models (particle, wave, etc.) depending on the calculation. The wave function is a mathematical abstraction that, despite lacking direct physical intuition, predicts measurement outcomes with incredible precision.
Extension← The Era of Experience: Reinforcement Learning Beyond Human Data · David Silver
The former points out the limitations of searching and reasoning in a discrete and combinatorially massive language space, while the latter proposes that agents can use non-human languages (such as continuous or distributed computation) to discover more efficient thinking mechanisms, providing a potential path to overcome the search bottleneck in language space.
This[01:47:45 - 01:50:32] Although MCTS is powerful in Go, applying it directly to open-ended domains like large language model reasoning is difficult. The action space of language is combinatorially larger and less discrete, making exploration heuristics like PUCT ineffective, and defining a reliable intermediate value function for search truncation is also much harder.
RelatedPlanning and Reasoning · "discover or improve such approaches by learning how to" Agents can use non-human languages (such as symbolic, distributed, or continuous computation) to discover or improve more efficient thinking mechanisms, without being limited to mimicking human chains of thought.
Corroboration← The Era of Experience: Reinforcement Learning Beyond Human Data · David Silver
The former emphasizes the importance of breaking free from human prejudgement and relying on embodied feedback from the environment, while the latter's explanation of the AlphaGo self-play mechanism is precisely about exploring Go strategies that surpass humans without human guidance, through deterministic win/loss feedback (grounded rewards) provided by the game rules.
This[01:02:46 - 01:03:17] The system improves through a self-play reinforcement learning loop, where MCTS acts as a "policy improvement operator". For any given board state, MCTS performs a deep search to generate a better, more confident move distribution than the policy network's initial guess. The policy network is then trained to directly predict this improved distribution.
RelatedRewards · "Relying on human prejudgement in this manner usually leads" Relying on rewards from human prejudgement sets an insurmountable ceiling on agent performance, whereas embodied/grounded rewards from the environment allow agents to discover new strategies that transcend existing human knowledge.
Corroborates← Representation Learning and Predictive World Models · Saining Xie
Jang's technical observation that language is an unsuitable medium for structured search and reasoning corroborates Xie's broader philosophical claim that language is merely a condensed communication tool rather than a direct map of thinking.
This[01:47:45 - 01:50:32] Applying MCTS directly to LLM reasoning is difficult because language's action space is combinatorially larger and less discrete, making it hard to define a reliable intermediate value function for search.
Related[03:53:57 - 04:10:06] Language is a highly condensed communication tool developed by humans, not a direct map of thinking or decision-making; therefore, relying solely on language models creates a "crutch" that limits the development of real intelligence.
Complements← The Reality of Frontier AI and the End of Individual Heroism · Yao Shunyu
Yao's explanation of why coding is the fastest-developing AI scenario complements Jang's point by showing how well-defined compilation and test feedback directly solves the core bottleneck of sparse, high-variance reward signals in LLM reinforcement learning.
This[01:28:29] Reinforcement learning for LLMs typically suffers from high variance and sparse reward signals ('sipping supervision through a straw'), making credit assignment and efficient learning difficult.
Related[00:35:54] Coding is the fastest-developing AI scenario because it has highly well-defined feedback signals and a massive, high-quality data foundation in GitHub.
Extends← How an AI Chip Works from the Bottom Up · Reiner Pope
This episode explains a specific mechanism—the super-linear performance gain from low-precision arithmetic—that underpins the general hardware improvements cited as a key reason for AI's falling computational cost.
This[00:01:49 - 00:02:13] The computational power required to build a world-class Go AI has decreased dramatically, thanks in part to significant hardware improvements.
Related[00:14:54 - 00:15:41] The quadratic scaling of circuit area with bit width is the main reason low-precision arithmetic is so effective for AI, as halving precision can more than double performance.
Complements← How an AI Chip Works from the Bottom Up · Reiner Pope
The TPU's hardware architecture, with its large specialized units, is an ideal physical implementation for the algorithmic principle of amortizing search into a single, massive parallel computation, as described in the AlphaGo episode.
This[01:17:31 - 01:18:46] A profound insight from AlphaGo is that a neural network can 'amortize' the computation of a massive search by compressing the sequential reasoning process into a single, parallel forward pass.
Related[01:16:03 - 01:18:20] A TPU is organized into a few large, coarse-grained units, a design that is highly efficient for large, structured computations by amortizing the cost of control logic.

Tensions with past episodes

ContrastApparent tension← Representation Learning and Predictive World Models · Saining Xie
Jang highlights how AlphaGo successfully compresses sequential reasoning (System 2 search) into a single reactive forward pass (System 1 amortization), whereas Xie argues that a true world model must explicitly preserve predictive planning (System 2) to guide actions rather than relying on reactive policies.
This[01:17:31 - 01:18:46] A shallow neural network can learn to "amortize" a huge search, compressing complex, sequential reasoning into a single, parallel forward pass.
Related[04:17:24 - 04:17:41] The essence of a World Model is how to characterize a system and an environment such that you can make predictions... and this prediction can guide your action sequence, enabling planning and reasoning (System 2 thinking) rather than just reactive policies (System 1).

A faithful reconstruction and plain-language retelling of the episode, generated by PodLens.

This is one source-grounded reading, not a replacement for the original. Every point is anchored to its source, so you can check it yourself — and corrections are welcome.