Exploration and Reflection on Large Model Post-Training Reinforcement Learning Infrastructure · Weng Jiayi
2026-06-09 · A faithful, transcript-grounded reading by PodLens
Original episode:https://youtu.be/I0DrcsDf3Os?si=RbqE6pkIgHFJh5mq · Timestamps are clickable — they seek the player in place
Reinforcement LearningPost-TrainingInfrastructureTianshouDeterminism
What This Episode Is About
This episode features a conversation with Weng Jiayi, a core contributor to OpenAI's post-training reinforcement learning (RL) infrastructure. The interview traces his academic journey at Tsinghua University and Carnegie Mellon University (CMU), shares the behind-the-scenes stories of developing the open-source projects Tianshou and tuixue.online, and dives deep into his engineering practices in building large model reinforcement learning post-training infrastructure (RLHF) at OpenAI. He points out that the competitive barrier in large model R&D lies in the correctness of the infrastructure and the iteration speed per unit of time, rather than pure algorithmic ideas. In addition, he shares his thoughts on the definition of Artificial General Intelligence (AGI), team talent density, the flow of information in organizational structures, and philosophical reflections on determinism and predicting the future.
Timeline Topic Map
- [00:00 - 20:39] Guest Background and Upbringing: Weng Jiayi introduces how he entered Zhu Jun's lab during his sophomore year and stumbled into choosing reinforcement learning (RL) as his research direction.
- [20:40 - 27:56] Early Academic Hacking Experiences and Research Struggles: He recounts his experiences at Tsinghua University optimizing the campus network by finding system vulnerabilities and becoming a champion in game AI (VizDoom), while physically interpreting the drawbacks of RL research at the time, which was overfitted and relied entirely on heuristic parameter tuning.
- [27:57 - 41:15] Summer Research Setbacks and Reflection on Evaluation Systems: His summer research on Mixture of Experts (MoE) at Mila under Yoshua Bengio failed to yield ideal results. Upon returning to Tsinghua University, he faced the pressure of failing PhD applications and began to reflect on and attempt to break free from the university's singular GPA evaluation system.
- [41:16 - 48:10] The Birth and Design Philosophy of the Open-Source Project Tianshou: Out of dissatisfaction with the over-abstraction of the existing RLlib library, he built Tianshou from scratch, emphasizing code consistency and minimalist abstraction for researcher usability.
- [48:11 - 56:25] The Public Welfare Nature and Personal Impact Metrics of tuixue.online: The origin and effectiveness of developing the visa date query system tuixue.online, explaining his internal metric for measuring personal impact as "the number of people who remember my name after I die," and pursuing the positive feedback brought by non-profit charitable projects.
- [56:26 - 1:04:10] Choosing Between Industry and Academia: During his Master's studies at CMU, he decided to target the industry, comparing and analyzing the value of PhDs versus Master's degrees in the AI era, pointing out that engineering capability is paramount at the current stage.
- [1:04:11 - 1:08:44] The Core Value of Engineering Capability and Infrastructure: He proposes the view that "teaching researchers to do engineering is harder than teaching engineers to do research," revealing that the current competition in large models is essentially about the correctness of the infrastructure (infra) and the ability to eliminate bugs.
- [1:08:45 - 1:14:12] The Opportunity to Join OpenAI and the Interview with John Schulman's Team: He shares the process of being recruited by John Schulman and his experience of independently writing end-to-end code in two hours to complete the final round of interviews, clarifying his preference for "selling shovels" (doing infrastructure) over directly doing research parameter tuning.
- [1:14:13 - 1:23:39] Developing ChatGPT and RLHF Infrastructure Inside OpenAI: He reveals the team size and R&D atmosphere before the release of ChatGPT, explains the difficulty of using the PPO algorithm pipeline in early RLHF, and discusses how to measure reinforcement learning performance and prevent reward hacking.
- [1:23:40 - 1:32:00] Differences Between Large Model RLHF and Traditional Reinforcement Learning: He analyzes the fundamental differences between traditional toy tasks and large model reinforcement learning in terms of model scale, sampling throughput, and computational efficiency, and shares his experience of entering the emergency room (ER) due to high-intensity overtime.
- [1:32:01 - 1:39:47] Refactoring the Next-Generation OpenAI Infrastructure: He explains the necessity of refactoring the old infrastructure that had run for three years, emphasizing the need to clean up technical debt and improve the experimental iteration speed for researchers.
- [1:39:48 - 1:44:16] The Business Logic of Closed Source and the AGI Mission: He analyzes that OpenAI's closed-source strategy is based on game theory considerations for commercial survival and financing, and discusses how to break down "benefiting all of humanity" into allowing ordinary people to access technological products for free or at low cost.
- [1:44:17 - 1:47:33] Board Infighting and the Departure of Talents Like John Schulman: He explains the truth behind Sam Altman's dismissal from an internal perspective (due to a trust crisis rather than the discovery of dangerous technology) and discusses the organization's "talent generation capability" and personnel substitutability.
- [1:47:34 - 1:52:53] Communication Costs Brought by Organizational Scaling and the Alertness Caused by DeepSeek: He analyzes the problems of bloated code and structures, and the loss of context sharing when an organization grows larger, mentioning internal vigilance regarding DeepSeek's extremely fast iteration speed.
- [1:52:54 - 2:02:42] Deterministic Worldview and Predictions for the Future: He discusses the determinism of the universe, whether humans possess free will, the modification of world lines in quantum mechanics, and how to live in the present and invest in the future with a Sisyphus-like sense of happiness.
Core Viewpoints List
- Core Viewpoint: The competition in large models and the frontier exploration of artificial intelligence is essentially a battle over the correctness of infrastructure and the iteration speed per unit of time.
- Evidence: [01:04:49 - 01:05:19]
- Type: Opinion
- Core Viewpoint: Teaching a researcher to do engineering is much more difficult than teaching an engineer to do research.
- Evidence: [01:04:26 - 01:04:48]
- Type: Opinion
- Core Viewpoint: The decay of codebases and projects mostly stems from the inconsistency and failure of assumption propagation caused when multiple developers contribute code; maintaining consistency from start to finish is the key to high-quality code.
- Evidence: [01:04:41 - 01:05:23]
- Type: Opinion
- Core Viewpoint: Traditional reinforcement learning (RL) research relies excessively on overfitting and heuristic parameter tuning on toy tasks (Atari, MuJoCo, etc.), whereas what the industry truly cares about is using RL to solve real-world complex environment problems.
- Evidence: [01:11:51 - 01:12:31]
- Type: Fact
- Core Viewpoint: The pain point in measuring the performance of reinforcement learning models lies in the difficulty of distinguishing the true quality of checkpoints, because a single reward value is prone to reward hacking, leading to excessive evaluation variance and noise, which ultimately still requires reliance on Human Feedback.
- Evidence: [01:25:31 - 01:26:58]
- Type: Fact
- Core Viewpoint: AGI R&D teams need to maintain an extremely high talent density. The core value of high talent density lies in the spontaneous emergence of innovation, while ensuring lossless information transmission between management and ground-level executors by flattening and simplifying the organizational structure.
- Evidence: [01:21:11 - 01:22:49]
- Type: Opinion
- Core Viewpoint: OpenAI's closed-source strategy is a realistic consideration based on game theory: if the weights of the most advanced models are open-sourced, other commercial competitors will quickly replicate them and implement closed-source strategies, causing the pioneer to lose the capital to continue financing and sustain survival.
- Evidence: [01:42:01 - 01:42:56]
- Type: Prediction
- Core Viewpoint: The expansion of organizational scale inevitably leads to slower iteration speeds. The fundamental reason is that the context stored in the human brain is limited, making it difficult to achieve complete and consistent context sharing in a large organization.
- Evidence: [01:52:01 - 01:52:48]
- Type: Opinion
- Core Viewpoint: The underlying universe is a deterministic system, and free will does not exist; every person's thoughts, decisions, and future world trajectories were already determined at the moment of the Big Bang.
- Evidence: [01:53:40 - 01:54:08]
- Type: Conjecture
Internal Tensions and Self-Corrections
- [01:35:31] vs [01:50:58]: Weng Jiayi strongly advocates breaking free from the shackles of university GPA and established evaluation systems, yet the ultimate metric of achievement he sets for himself ("the number of people who remember my name after I die") is essentially still external social recognition (such as GitHub stars, tuixue.online clicks, etc.). This constitutes an internal tension between resisting external evaluation and relying on external recognition.
- [01:53:40] vs [02:00:33]: He firmly believes that the physical world is completely deterministic and that humans have no free will, yet at the same time, he emphasizes the need to "invest in the future" through hard work in the present to gain the right to choose. After the host pointed out that this contradicts determinism, he could only attribute "the act of investing in the future itself" to a pre-determined outcome.
Plain English Retelling
What is the core competitiveness of large model research? In Weng Jiayi's view, the answer is by no means the exquisite algorithms or paper ideas that academia is passionate about, but rather extremely simple engineering practices—the correctness of the infrastructure and the iteration speed per unit of time. He cites a colleague's view that teaching a researcher how to do engineering well is far more difficult than teaching an engineer how to do research well. In the current frontier exploration of large models, ideas are very cheap; what truly sets competitors apart is who can validate these ideas more safely and quickly. Every research lab's model architecture has varying degrees of bugs; whoever can fix more bugs trains their model better.
This extreme preference for "engineering consistency" and "making infrastructure tools (selling shovels)" runs through Weng Jiayi's academic and professional career. During his senior year, dissatisfied with the overly bloated and complex abstractions of the mainstream reinforcement learning library RLlib, he spent two weeks building the first version of Tianshou from scratch. He believes that the vitality of a project lies in consistency; multiple people disorderly cramming code in will only accelerate the project's decay. The birth of tuixue.online was similarly driven by his personal pain point of querying visa dates. This completely non-profit charitable project, through millions of clicks, brought him satisfaction beyond money—he uses "the number of people who remember my name after I die" as the settlement metric of his life, and compared to official evaluation systems, he yearns more for heartfelt approval from the outside world.
After joining OpenAI, Weng Jiayi was responsible for building the entire reinforcement learning infrastructure for the post-training phase. He points out that reinforcement learning for large models is fundamentally different from traditional toy benchmarks (such as playing games or physical simulations): the bottleneck of toy tests lies in the environment, while the model itself is very small; large models, on the other hand, have extremely simple environments (inputting prompts) but extremely massive model parameters, making how to improve sampling throughput and distributed training efficiency the core issues. During the development of ChatGPT, the team experienced immense uncertainty: releasing ChatGPT was initially just to collect real-world user data, and they were even prepared to shut it down after five days if it met with a cold reception; no one expected the user curve to explode exponentially. In addition, he points out that the biggest headache when evaluating reinforcement learning models is reward hacking—the model's reward score appears perfectly saturated, but its actual performance declines due to overfitting, forcing them to fall back on human evaluation to filter model checkpoints.
Regarding the organizational changes inside OpenAI, Weng Jiayi provides a unique internal perspective. The CEO ouster turmoil at the end of 2023 was not, as rumored outside, because "scientists saw some kind of devastating technological breakthrough," but purely a trust crisis of the board toward Sam Altman. In the process of the company expanding rapidly from over two hundred people to over three thousand, communication costs multiplied, and the limited context of the human brain led to a severe lack of context sharing. This also explains why DeepSeek's claimed ultra-fast iteration speed on Twitter put OpenAI on alert internally—when an organization grows to a certain size, refactoring technical debt that has run for three years and reclaiming the iteration slope of the small-team era becomes a matter of life and death.
At the end of the conversation, Weng Jiayi reveals his deterministic worldview: the macroscopic universe is a deterministic Markov process, and humans do not have true free will; what you will say and do in the next second was already written in stone at the moment of the Big Bang. He once tried to falsify this view but failed. However, he believes that in the face of this cold fate, the most rational thing for a person to do is to forget all of it and experience life in the present, just like firmly believing that Sisyphus pushing the boulder is happy, finding inner peace within the endless deterministic trajectory.
Segments Worth Listening Closely To
- [41:16 - 45:00] The Two-Week Development Story of Tianshou's Birth. Weng Jiayi recounts how he decided to tear everything down and start over because he disliked RLlib's bloat, explaining the design philosophy that "project consistency is the only antidote to code decay." This conversation possesses great engineering aesthetics and is a principle every system architect should appreciate.
- [01:04:26 - 01:06:00] The Inversion of Difficulty Between Engineering and Research Capabilities. He shares his colleague's argument that "teaching researchers to do engineering is harder," directly deconstructing the ecological positioning of academia and industry in frontier large model R&D. For listeners confused about "doing a PhD vs. getting a job," this is a highly weighty, sober reflection.
- [01:25:31 - 01:27:00] The Metaphysics of Reward Hacking and Checkpoint Evaluation. Listen to him describe how, in large model RLHF training, they faced highly variant evaluation noise and saturated reward curves to perform blind testing of checkpoints. This segment reconstructs the real struggles of top labs when solving practical engineering problems.
- [01:53:40 - 01:56:00] Determinism and the Script Written in Stone by the Big Bang. The moment where Weng Jiayi extremely calmly and assuredly argues that humans have no free will and that fate is long predetermined. This geek-style philosophical coldness forms a strong dramatic tension with the overall relaxed technical narrative of the podcast, making it a highly striking moment.
Resonances with past episodes
- Isomorphic→ The Rise of AI-Native Companies and Personal Software Factories · Garry Tan & Diana Hu
Both reached a high consensus on organizational structure design, pointing out that middle management leads to loss in information transmission, and therefore a highly flattened architecture must be used to achieve lossless information transmission and efficient decision-making.
This[01:21:11 - 01:22:49] AGI R&D teams need to maintain an extremely high talent density. The core value of high talent density lies in the spontaneous emergence of innovation, while ensuring lossless information transmission between management and ground-level executors by flattening and simplifying the organizational structure.
Related[35:03-36:32] In AI-native organizations, traditional layer-by-layer reporting and information relaying will be flattened, leaving only three core roles. Middle management is a product of lossy routing. In AI-native organizations, personnel will be extremely compressed and flattened into: Builders, DRIs (Directly Responsible Individuals), and AI Founders who personally explore tools on the front lines.
- Corroboration→ The Rise of AI-Native Companies and Personal Software Factories · Garry Tan & Diana Hu
Both point out the limitations of quantitative metrics or general benchmarks in evaluating AI model performance, emphasizing that when facing complex evaluation noise and reward hacking behaviors, human subjective judgment (human feedback or human 'taste') must ultimately be introduced as the final screening standard.
This[01:25:31 - 01:26:58] The pain point in measuring the performance of reinforcement learning models lies in the difficulty of distinguishing the true quality of checkpoints, because a single reward value is prone to reward hacking, leading to excessive evaluation variance and noise, which ultimately still requires reliance on Human Feedback.
Related[37:18-38:29] When the cost of writing and implementing code drops to zero, the only human asset that cannot be delegated or replaced is 'taste'. General benchmarks cannot determine whether an AI in a specific vertical domain is useful. Human Taste (the grasp of subtle product experiences and the ability to discern right from wrong) is the ultimate defense line determining commercial value capture, which requires embedding Taste into the system by building unique evals.
- Complement→ The Rise of AI-Native Companies and Personal Software Factories · Garry Tan & Diana Hu
Both point out the core pain point when traditional organizations expand: information (context) is only stored in individual employees' brains, making it difficult to achieve efficient and consistent sharing within the organization, thereby leading to a decline in decision-making and iteration efficiency.
This[01:52:01 - 01:52:48] The expansion of organizational scale inevitably leads to slower iteration speeds. The fundamental reason is that the context stored in the human brain is limited, making it difficult to achieve complete and consistent context sharing in a large organization.
Related[31:39-33:32] Traditional corporate organizations operate in a highly 'open loop' manner full of information loss, whereas AI can transform them into 'closed loop control systems'. Diana Hu believes that traditional companies store information in employees' brains, routing it through chaotic Slack DMs and meetings, which is extremely inefficient. Introducing embedded agents to read all company artifacts in real-time can build a self-healing, PID-controller-like closed-loop information and decision-making circuit.
- Corroboration→ Product Building and Career Evolution in the AI Era · Nikhyl Singhal
Both jointly emphasize the decisive role of 'iteration speed' in competition, extending this product-level golden rule to the infrastructure construction of large model frontier exploration.
This[01:04:49 - 01:05:19] The competition in large models and the frontier exploration of artificial intelligence is essentially a battle over the correctness of infrastructure and the iteration speed per unit of time.
Related[13:54] Product iteration speed determines the success or failure of a product more than its initial state, constituting the core advantage of startups against large companies.
Tensions with past episodes
- TensionApparent tension← Computational Design and Synthetic Biology · Neri Oxman
The former uses a mathematical model of high-to-low entropy transition to argue that an agent possesses "empowerment" and agency to make deterministic choices among infinite options and control the system's trajectory; whereas the latter starts from physical determinism, arguing that free will does not exist and all choices and trajectories have already been completely determined by the underlying laws of the universe.
This[01:53:40 - 01:54:08] The universe is fundamentally a deterministic system, and free will does not exist; every person's thoughts, decisions, and future world trajectories were already determined at the moment of the Big Bang.
Related[29:26-30:24] An agent's state of empowerment can be defined by a high entropy value in the distribution of its possible states, yet when a specific action and choice occur, the entropy of its single concrete state must be extremely low. That is, possessing the ability to make a deterministic choice among infinite options and control the system's trajectory.
This is one source-grounded reading, not a replacement for the original. Every point is anchored to its source, so you can check it yourself — and corrections are welcome.