The Discipline of Value Delivery per Gigawatt · Amin Vahdat

2026-06-09 · A faithful, transcript-grounded reading by PodLens

Original episode:https://youtu.be/VeTqsCpcDgg?si=Hv0WYODnPWCCT-qP　·　Timestamps are clickable — they seek the player in place

Compute InfrastructureValue per GigawattData CentersEnergy EfficiencySystem Co-design

What This Episode Is About

This episode features a guest lecture from Stanford CS153 (Frontier Systems), delivered by Amin Vahdat, Vice President of Systems Infrastructure at Google. Centered around the theme of "Value per Gigawatt," the lecture explores how, in the era of AI scaling, the focus of compute infrastructure construction must shift from mere "capacity and scale (Megawatt/Gigawatt/FLOPs)" to "actual delivered user value and effective output (Goodput/Daily Active Users)." Drawing on Google's nearly 30 years of infrastructure practice, Amin Vahdat analyzes system balance, Amdahl's Law, the trade-offs between high availability and reliability, and the constraints of power supply and supply chains in hyperscale systems. He also shares Google's insights on TPU interconnection technologies (such as Optical Circuit Switching, OCS), hardware-software co-design, and environmental and community responsibility.

Timeline Topic Map

[00:09 - 02:24] The host introduces Amin Vahdat's central role in Google's infrastructure (such as Borg and TPU) and introduces the background of a 1-gigawatt compute build costing billions of dollars.
[02:25 - 05:25] Exploring the measurement flaws of gigawatt-level capacity, Amin Vahdat argues that the focus should be on the actual value delivered to users per dollar or per gigawatt, rather than blindly pursuing compute scale.
[05:26 - 08:39] Discussing how to measure "intelligent output," pointing out that compared to low-level FLOPs, ultimate user retention and business metrics (such as Daily Active Users) are the fundamental measures of compute efficiency.
[08:40 - 11:25] Analyzing in detail the relationship between high reliability and redundant design in hyperscale clusters, pointing out that thousands of components in system design can fail, and the cost of improving reliability is immense.
[11:26 - 12:28] Pointing out a shift in attitude among frontier labs in model training, where they "prioritize compute capacity over absolute reliability," accepting short-term downtime in exchange for double the capacity.
[12:29 - 15:15] Explaining the sensitivity of synchronous training (such as All-reduce collectives) to single points of failure; compared to the loosely coupled architectures of the internet era, modern accelerator training requires that not a single node fails.
[15:16 - 17:34] Introducing how Amdahl's Law manifests in system balance, emphasizing that FLOPs must be supported by matching HBM (High Bandwidth Memory) bandwidth, SRAM cache, and network I/O, otherwise it leads to extremely low MFU (Model FLOPs Utilization).
[17:35 - 20:06] Explaining the memory bandwidth bottlenecks brought by sparse computation in mixture of experts (MoE) models, and how to extend the perspective of system balance to CPUs, storage, and data center networks.
[20:07 - 22:42] Discussing the physical limits of procurement and supply chains, and how to forecast compute demand and perform dynamic replanning under long lead times of two to three years.
[22:43 - 25:34] Exploring the shortage of power grid connection capacity, and how hyperscale cloud providers' preference for expandable large sites leaves small and medium sites under 100 megawatts idle, predicting that inference serving demands will drive compute deployment toward fragmentation and flexible scheduling.
[25:35 - 28:25] Q&A session: Advising Stanford students to find the fields they are most passionate about, as algorithms, hard engineering, operating systems, and model architectures are all equally important, and the future is hard to predict.
[28:26 - 31:30] Recounting the debate during the design of Google TPU v2, where Amin Vahdat admits his traditional view of using ethernet in TPU supercomputers was proven wrong, and the point-to-point distributed shared memory design by Norm Jouppi and others ultimately won.
[31:31 - 32:48] Briefly describing how Sundar Pichai merged Brain and DeepMind during the ChatGPT Code Red period and unified the infrastructure teams, accelerating the reshaping of company culture.
[32:49 - 36:51] Analyzing in detail the role of Optical Circuit Switches (OCS) in Google data centers, utilizing mirrors for microsecond/second-level topology reconfiguration to isolate faulty racks and directly connect to remote storage.
[36:52 - 38:03] Exploring interconnect topologies for machine learning training, explaining that torus topologies are suitable for All-reduce collectives, while switch topologies are suitable for All-to-all communication, and model designers optimize software around these topologies.
[38:04 - 40:21] Discussing hardware depreciation and planning, pointing out that Google's compute hardware depreciates over six years, with old chips still seeing extremely high utilization, requiring the planning department to replan daily under immense uncertainty.
[40:22 - 42:04] Discussing the development of robotics, using Waymo as an example to illustrate the importance of local real-time performance and safety, pointing out its extremely high requirements for edge compute and latency.
[42:05 - 43:13] Exploring recently announced industry compute collaborations (such as SpaceX/xAI with Anthropic, and Cursor collaborations), pointing out that this is driven by the shortage of inference compute caused by the explosion of coding agents.
[43:14 - 47:19] Sharing personal experiences, from being inspired by a magazine at age 6 to become a programmer, to joining Google during an academic sabbatical and staying because of the ability to solve real-world technical problems.
[47:20 - 49:36] Introducing the competition and evolution between TPUs and GPUs, pointing out that Google is specializing its TPU product line (such as the 8i inference chip and 8t training chip), and hardware specialization is an inevitable trend to improve energy efficiency ratios.
[49:37 - 52:56] Exploring non-zero-sum games and supply chain concentration risks, pointing out that component manufacturers also need Google as a customer to avoid the risk of a single customer monopolizing capacity, encouraging students to discard a winner-take-all mindset.
[52:57 - 54:42] Emphasizing the responsibility of technologists in societal transitions, calling for the integration of guardrails and safety mechanisms in technology deployment.
[54:43 - 56:41] Discussing that the biggest bottleneck in infrastructure innovation lies in energy, and achieving global-scale energy abundance and affordability requires systemic investment.
[56:42 - 58:39] Analyzing the shortcomings of solar, wind, and batteries as mature solutions, and exploring fringe solutions like data center space and space-based solar power.
[58:40 - 01:00:05] Pointing out that hardware will continue to be a bottleneck for the next 5 to 10 years; even with algorithmic breakthroughs like transformers (which improved energy efficiency by 5x over LSTM), the saved compute will immediately be consumed by new demands.
[01:00:06 - 01:03:08] Discussing Google's technical innovations for harmonious coexistence between data centers and local communities, such as choosing waterless cooling designs that sacrifice 10% energy efficiency in water-scarce regions, and utilizing gigawatt-scale demand response to shave peaks and fill valleys for the power grid during peak electricity usage.
[01:03:09 - 01:04:16] Concluding remarks, calling on cloud infrastructure builders to think end-to-end about optimal system scale, energy-efficient delivery, and coexistence with communities.

Key Takeaways List

The true measure of compute capacity is the actual value delivered per dollar (Value per Dollar) or user activity (Daily Active Users), rather than simply gigawatts (Gigawatts) or hardware FLOPs. - Evidence Anchor: [04:55-05:08] - Type: Opinion - Speaker's Reservations: Admits to spending a vast amount of effort on procuring data center power capacity, but still insists that the value metric must come first.
Modern accelerators exhibit extremely high synchrony (Synchronous Computation) in large model training, which causes clusters to regress from loosely coupled fault-tolerant architectures back to tightly coupled supercomputer states where a single point of failure halts the entire system. - Evidence Anchor: [12:29-13:57] - Type: Fact
Internal and external customers of frontier labs are demonstrating a new attitude: to obtain more compute capacity, they are willing to sacrifice a portion of service reliability (accepting 99.9% or even lower availability). - Evidence Anchor: [11:26-12:28] - Type: Opinion
System Balance is key to unleashing compute power. If FLOPs increase but HBM throughput, SRAM cache, and network bandwidth do not scale proportionally, compute power will be wasted waiting for data, resulting in extremely low MFU. - Evidence Anchor: [14:06-15:15] - Type: Opinion
The popularity of sparse computation algorithms like Mixture of Experts (MoE) leaves most current hardware systems that lack matched designs facing severe memory bandwidth shortages. - Evidence Anchor: [16:53-17:34] - Type: Fact
As AI shifts from training-dominated to inference (Serving)-dominated, compute deployment will gradually divert from massive contiguous clusters to small, dispersed, and highly flexibly schedulable small-to-medium sites under 100 megawatts. - Evidence Anchor: [25:05-25:24] - Type: Prediction
Hardware Specialization is the inevitable path to solving the bottleneck of CPU performance scaling. Google's bifurcation into 8i (inference) and 8t (training) TPUs is determined by the differences in memory, network, and compute ratios required by different workloads. - Evidence Anchor: [47:20-48:50] - Type: Opinion
Compute hardware will remain the primary bottleneck for the next 5 to 10 years. Any algorithmic breakthrough in energy efficiency will be rapidly consumed by new, more valuable compute demands due to Jevons paradox. - Evidence Anchor: [58:40-01:00:05] - Type: Prediction
Zero-sum games and "winner-take-all" are limited technical perspectives. A healthy supply chain requires diversity to hedge against concentration risks like geopolitics and earthquakes; component manufacturers do not want a single customer to monopolize their capacity. - Evidence Anchor: [49:37-52:56] - Type: Opinion
Data centers should serve as active assets to communities rather than burdens. This requires data center builders to make trade-off decisions, such as using dry cooling that sacrifices 10% energy efficiency in water-scarce regions, or using Demand Response technology to help public power grids shave peaks and fill valleys.
- Evidence Anchor: [01:00:06-01:03:08]
- Type: Fact

Plain English Retelling

These days, we are often blown away by massive numbers—like "some company just secured 1 gigawatt of power" or "this data center campus cost tens of billions of dollars." But Amin Vahdat pours cold water on this: if your system design is unbalanced, or if it crashes every day, all those gigawatts are just expensive decorations wasting electricity. It's like buying a supercar capable of hundreds of miles per hour, only to get stuck on a narrow, potholed, muddy road where the engine's horsepower (equivalent to FLOPs) can't translate into speed. What you really need to focus on is how much cargo you actually delivered and how much money you made by driving the car out—that is the "actual value delivered per gigawatt."

In the past, internet services (like Google Search) pursued "five nines" of availability (99.999%), which translates to only 30 seconds of downtime a year. To achieve this, we had to build double backups for all power and networking, meaning half of the power supply and equipment sat idle most of the time. But now, when frontier labs train large models, their attitude has completely flipped: give them twice the compute power, and they will gladly sign off on it even if it completely breaks down for a few days a year. Because training models is a "compute-devouring beast," they care far more about training the model as fast as possible than about never going offline.

But this brings an incredibly difficult hard engineering challenge: past internet services were loosely coupled—if one server broke, others filled in, and users never noticed. Today's AI model training, however, is synchronous (all TPUs/GPUs need to exchange parameters frequently and synchronously). This means that if just one of tens of thousands of accelerators drops offline due to a network or cooling issue, the entire training job has to halt and roll back. This demands extremely precise interconnect technology, such as the Optical Circuit Switching (OCS) technology used by Google. Simply put, OCS is like an automated "fiber-optic plug-and-unplug machine." It contains hundreds of micro-mirrors that can rotate in three dimensions. Once a device in a certain rack is found to be faulty, the software can manipulate motors to deflect the mirrors, instantly bypassing the broken node and swapping in a spare, restoring the entire system in seconds.

Another pain point ignored by most is "system balance." Many people buy accelerators looking only at how many TFLOPs they have, but with the rise of sparse computation like Mixture of Experts (MoE), compute itself is actually not the hardest part; the hardest part is how to feed data to that compute. If the HBM (High Bandwidth Memory) isn't fast enough and the network isn't wide enough, the compute will just sit there waiting. This is why current hardware utilization (MFU) is generally miserably low. In the future, hardware "specialization" will become increasingly prominent. This is exactly why Google introduced the 8i TPU specifically for inference and the 8t TPU for training—because inference requires frequently fetching different data, while training requires massive synchronous computation, and the ratios of memory, network, and compute for the two are completely different.

Finally, Amin Vahdat reminds us not to fall into a "do-or-die" zero-sum game mindset. Even with algorithmic breakthroughs (like the transition from LSTM to Transformer, which boosted efficiency by 5x), hardware will never be in surplus. Because humanity's thirst for intelligence is infinite, any saved compute will immediately be filled by newer, more valuable applications (such as more complex agent collaboration). In this process, the real bottleneck is shifting from chip manufacturing to deeper physical constraints—namely, energy. How to efficiently acquire and schedule green energy, and make data centers act as "batteries" for local power grids (actively curtailing power and disconnecting during peak residential usage, and absorbing excess power during off-peak hours), will be the most central infrastructure challenge of the next decade.

Clips Worth Listening Closely

[11:26 - 12:28] Exploring the trade-off between availability and compute capacity. Amin Vahdat mentions that frontier lab customers are willing to accept 3.65 days of annual downtime in exchange for double the capacity. This reveals a major shift in customer value orientation under the AI wave, with a tone of awe toward this new paradigm.
[28:26 - 31:30] Recounting the debate between ethernet and custom networking during the design of Google TPU v2. Amin Vahdat candidly reflects on how he, as a networking expert, determined at the time that ethernet was the only solution, only to be proven wrong by Norm Jouppi and others. This self-correcting line of thought is extremely vivid, showcasing the candor and pragmatism of a top systems architect.
[32:49 - 36:51] A segment describing in detail how Optical Circuit Switching (OCS) works. Amin Vahdat uses intuitive language to explain micro-mirror deflection, MEMS control, and how software-defined topology is achieved. It is highly information-dense and serves as an excellent entry point for understanding the physical layer design of modern AI clusters.
[49:37 - 52:56] An impromptu expression opposing "zero-sum games" and "win-lose mindsets." Both Amin Vahdat and the host correct the concentration risk and zero-sum mindset commonly found in the classroom at the time, demonstrating a grand industrial vision and first-principles thinking of systems scientists.

Resonances with past episodes

Isomorphism→ Frontier Systems and the Future of Voice AI · Mati Staniszewski
The two are highly isomorphic in value measurement, both advocating for breaking away from traditional 'resource- and cost-oriented' thinking (such as compute costs, gigawatts, FLOPs) and shifting toward 'actual value ultimately delivered to the user' as the core standard for measuring system efficiency and commercial success.
This[04:55-05:08] The true measure of compute capacity is the actual value delivered per dollar (Value per Dollar) or user activity (Daily Active Users), rather than simply gigawatts (Gigawatts) or hardware FLOPs.
Related[42:06-42:35] AI product pricing should be completely decoupled from compute costs and reverse-engineered based on the value created for customers. A reasonable pricing model should aim to capture one-tenth of the total economic value created for the customer.
Corroboration→ Energy Bottlenecks and the Physical Breakthrough of Uranium Enrichment · Scott Nolan
Both point out that energy efficiency improvements at the algorithmic level cannot eliminate rigid bottlenecks in the physical world. The former explains through Jevons paradox that efficiency gains actually stimulate larger compute demands, while the latter points out from the perspective of physical limits that energy cost is the ultimate constraint on compute expansion that cannot be erased by algorithms.
This[58:40-01:00:05] Compute hardware will remain the primary bottleneck for the next 5 to 10 years. Any algorithmic breakthrough in energy efficiency will be rapidly consumed by new, more valuable compute demands due to Jevons paradox.
Related[09:11-09:53] Chip manufacturing costs and model training costs will continue to decline with technological iterations, but the electricity consumed to run these massive models is a rigid physical demand that cannot be completely erased by algorithms; all compute competition will ultimately converge on the competition over energy costs.

Tensions with past episodes

ContradictionDirect conflict→ Computing Infrastructure and the Continuous Operation of Intelligence · Jensen Huang
The two have a fundamental disagreement regarding Model FLOPs Utilization (MFU) in system design. The former views low MFU as a warning sign of system imbalance and wasted compute; whereas the latter advocates for actively overprovisioning compute and accepting low MFU, treating redundant compute as a cheap resource to bypass overall system bottlenecks caused by network and memory bandwidth.
This[14:06-15:15] System Balance is key to unleashing compute power. If FLOPs increase but HBM throughput, SRAM cache, and network bandwidth do not scale proportionally, compute power will be wasted waiting for data, resulting in extremely low MFU.
Related[27:11-28:57] Model Flops Utilization (MFU) is a limiting metric that easily biases design; system design requires compute overprovisioning. To bypass Amdahl's Law when dynamic bottlenecks occur in network latency, storage throughput, and memory bandwidth, the system must possess sufficient redundant compute, treating Flops as a cheap resource and sacrificing local utilization to ensure instantaneous high-concurrency throughput for the overall task.
ContrastApparent tension→ System Design of Venture Capital and Paradigm Shifts in the Age of Intelligence · Ben Horowitz
From a macro capital perspective, the investor believes GPU compute can be mindlessly stacked with capital to achieve parallel acceleration; whereas the systems architect points out that the extremely high synchrony of large model training has regressed hardware clusters back to a tightly coupled state, where scaling up brings extremely high risks of single points of failure and system vulnerability, rather than simple linear acceleration through funding.
This[12:29-13:57] Modern accelerators exhibit extremely high synchrony (Synchronous Computation) in large model training, which causes clusters to regress from loosely coupled fault-tolerant architectures back to tightly coupled supercomputer states where a single point of failure halts the entire system.
Related[20:39-21:22] AI is breaking the historical law of software engineering that 'parallel acceleration cannot be achieved through capital and human resources.' In the AI era, sufficient GPUs and data can translate directly into capability breakthroughs, making the scale of funding a core competitive element.

A faithful reconstruction and plain-language retelling of the episode, generated by PodLens.

This is one source-grounded reading, not a replacement for the original. Every point is anchored to its source, so you can check it yourself — and corrections are welcome.