中文

Computing Infrastructure and the Continuous Operation of Intelligence · Jensen Huang

2026-06-09 · A faithful, transcript-grounded reading by PodLens

Original episode:https://youtu.be/tsQB0n0YV3k?si=pC6HMVlFXJZKNqtO · Timestamps are clickable — they seek the player in place

Computing InfrastructureCodesignNVIDIAInference InfrastructureContinuous Generative Computing

What This Episode Is About

NVIDIA founder and CEO Jensen Huang engaged in a deep conversation with host Anj in the Stanford CS153 classroom. Jensen Huang analyzed the most fundamental reshaping of computer science in 60 years: computing is evolving from an on-demand model based on pre-recorded retrieval to a real-time, generative, and continuously running Agentic system. He detailed the underlying logic of extreme codesign (Codesign) across the chip, compiler, software, and network layers, and explained how a 1-million-fold leap in computing power over 10 years supports the data explosion of generative AI. The conversation also discussed the commercial and security foundations of open-source versus closed-source models, the targeted optimization of the Vera Rubin hardware architecture for Agent-level low-latency tool calls, and how to view the deep coordination failure between computing bottlenecks and fragmented university research computing. Finally, Jensen Huang shared the strategic path of resetting towards robotics (Thor) after abandoning mobile (Tegra) during the company's development, as well as his personal view on resilience—that "90% of it is suffering"—providing clear system-level guidance for engineers and decision-makers in the era of ubiquitous intelligence.

Timeline Topic Map

Core Viewpoints List

  1. Computer science is undergoing a fundamental reshaping from "pre-recorded retrieval" to "real-time generation." The traditional computing paradigm essentially pulls and presents pre-recorded images, videos, or program binaries based on instructions; whereas in the Agentic era, computers perform real-time generation and reasoning based on a contextual understanding of intent. [01:17-03:13] | Type: Viewpoint
  2. Against the backdrop of the failure of Dennard Scaling, chip design must shift toward extreme collaborative design (Codesign) of hardware, compilers, and software stacks. The era of general-purpose CPUs relying on semiconductor scaling is over. Through global coordination of CPUs, GPUs, high-speed interconnects, switches, and libraries, NVIDIA achieved a 1-million-fold leap in computing performance within 10 years, whereas traditional hardware-only upgrades would have only yielded a 10-fold improvement. [10:02-12:20] | Type: Fact
  3. Model Flops Utilization (MFU) is a limiting metric that easily causes design bias, and system design requires compute overprovisioning. To avoid Amdahl's Law when dynamic bottlenecks occur in network latency, storage throughput, and memory bandwidth, systems must have sufficient redundant computing power, treating Flops as a cheap resource and ensuring instantaneous high-concurrency throughput for overall tasks at the expense of local utilization. [27:11-28:57] | Type: Viewpoint
  4. The Decode/Inference phase of large language models belongs to a memory bandwidth-constrained scenario, requiring high-density interconnect networks (such as NVLink 72) to achieve ultra-high energy efficiency. The reason the Blackwell architecture achieves a 50-fold tokens-per-watt improvement despite extremely low MFU in decode scenarios is that it aggregates the memory of 72 chips through a high-speed backplane bus, eliminating the fatal latency of reading and writing memory across network nodes. [29:33-31:30] | Type: Fact
  5. In the choices of critical system design, the art of strategy lies in finding a compromise between "narrow markets caused by high specialization" and "mediocrity brought by generalization." Although over-fitting (overfit) to a single task can achieve ultimate performance, it cannot support high R&D costs; over-generalization (general purpose) faces low efficiency in all areas. Architects must rely on intuition about the future of the industry to make strategic allocations. [32:05-33:03] | Type: Viewpoint
  6. Agent-level computing paradigms have spawned a processor hardware architecture (Vera Rubin) that is completely different from the cloud services era. When an Agent executes tool calls, the GPU is in a waiting state; its core bottleneck lies not in multi-core throughput, but in the extremely low latency of the CPU running single-threaded complex logic. Therefore, the Rubin architecture chooses to strengthen single-core low-latency performance on the CPU and mounts storage directly onto the ultra-high-speed bus fabric. [36:04-37:52] | Type: Fact
  7. The root cause of the "computing power famine" in academia and university research lies in the coordination failure of research funding, rather than the chip supply itself. Universities follow a fragmented model where individual labs independently compete for small grants, leaving them unable to afford the construction or leasing costs of centralized million-card clusters. The solution lies in budget restructuring, with the university level centrally allocating special funds on the order of $1 billion to build a campus-wide supercomputing cloud service shared by the entire school. [53:27-55:19] | Type: Viewpoint
  8. Resilience against setbacks cannot be learned in a greenhouse; it must be forged at the muscle level by enduring failure and facing desperate situations. 90% of a real career is about pain, challenges, and groping in the dark. The key to success is not pursuing endless happiness, but learning to maintain form during low points and allowing strategic mistakes to crystallize into long-term optionality for the enterprise. [42:00-45:04] | Type: Viewpoint
  9. Attempting to deprive other countries of general-purpose computing power not only confuses GPUs with atomic bombs in technical logic, but will also cause long-term ecological self-destruction to the US semiconductor industry. GPUs widely serve general-purpose civilian scenarios such as medical scanning and image rendering. If US semiconductor policy forces the abandonment of two-thirds of the global market, it will cause the domestic industry to shrink due to a loss of R&D funding, repeating the decline of the US telecom industry years ago. [47:29-50:34] | Type: Viewpoint

Plain English Retelling

So let's talk about Jensen Huang's share in the Stanford classroom. While most people are marveling at NVIDIA's skyrocketing market value, this conversation actually exposes his philosophical judgment on the underlying mechanics of the entire computing world, which is extremely hardcore.

First, we have to understand that the discipline of computer science is undergoing a complete reshuffle for the first time in 60 years. In the classical era established by IBM system 360, our use of computers was "retrieval-based": software, pictures, and videos were pre-written and recorded on the hard drive by programmers, and when you clicked, it retrieved them for you to see. But current AI computing is "real-time generated." More interestingly, we are saying goodbye to "on-demand computing" (On-demand). Previously, when we used computers, we only opened a webpage or sent a command when we needed it; but in the Agentic era, AI agents are constantly hanging in the background, "continuously running." This is like shifting from carrying water from a well every day to having water pipes installed at home, where the water flow is continuous.

This brings about a huge hardware and software strategic divergence. Many people are currently hyping Model Flops Utilization (MFU), which looks at whether the computing power of the graphics card you bought is fully utilized, and if the utilization rate is low, they think it's a waste. But Jensen Huang poured cold water on this. He believes that excellent system design should "pursue low MFU and overprovision computing power." Why? Because in a massive supercomputing cluster, computing power (Flops) is actually the cheapest resource; the real bottlenecks lie in network transmission, storage reading, and memory bandwidth. If you insist on squeezing computing power to 100%, system once it encounters sudden data congestion, it will get stuck on other bottlenecks (this is Amdahl's Law). This is like a highway: you can't cram all the cars onto it just to "maximize highway utilization," as that will only cause a massive traffic jam.

This "anti-utilization-only theory" directly guided the development of Blackwell and Rubin chips. For example, Blackwell NVLink 72 was designed to solve the Decode memory bandwidth issue in AI inference. Even if its MFU looks very low, the tokens it spits out per unit of power have exploded 50-fold. And when it comes to the Rubin architecture, they even specifically designed a single-core, ultra-fast CPU. Because when an Agent is executing tools (like querying a database or calling an API), the GPU is idle and must wait for the CPU to finish computing. If the CPU is slow, the expensive GPU cluster will spin its wheels in vain. This is all deduced from the first principles of global system codesign (Codesign).

Finally, he also debunked the truth about American universities "not being able to afford cards." He said that chips are actually in plenty of supply; Stanford not being able to afford them is not because NVIDIA is withholding sales, but because the university's incentive mechanism is broken. Professors all occupy their own hills and apply for small grants individually, and no one can save enough money to buy a large cluster—this is called "coordination failure." If Stanford really wants its students and professors to be at the forefront of AI, it should carve out $1 billion from its $40 billion endowment to directly lease a supercomputing cloud for the entire school to share. These words can be said to be very direct, but they also hit the nail on the head regarding the underlying tension between technological change and outdated organizational structures.

Segments Worth Listening Closely To

Resonances with past episodes

Tensions with past episodes

A faithful reconstruction and plain-language retelling of the episode, generated by PodLens.

This is one source-grounded reading, not a replacement for the original. Every point is anchored to its source, so you can check it yourself — and corrections are welcome.