share_log

Weekend Reading | A Comprehensive Overview of Google TPU: Meta’s Embrace, NVIDIA’s Plunge—All Tied to This “Lifesaving Chip”

Tencent Technology ·  Nov 29, 2025 14:38

Source: Tencent Technology
Author: Wu Ji

The stock price continues to fall, $NVIDIA (NVDA.US)$ all had to step forward and state, 'We are a generation ahead of the entire industry.'

The story begins with Buffett's 'final masterpiece' — Berkshire Hathaway's first position in $Alphabet-A (GOOGL.US)$ Alphabet, the parent company. More explosively, the market later reported that NVIDIA’s major client $Meta Platforms (META.US)$ is considering deploying Google TPUs in its data centers by 2027 and renting TPU computing power through Google Cloud starting in 2026.

In an urgent statement, NVIDIA emphasized that GPUs 'far outperform' ASICs (Application-Specific Integrated Circuits) in terms of performance, versatility, and portability. It reiterated that self-developed TPUs cannot replace the flexibility of GPUs. A Google spokesperson also stated that the company will continue its partnership with NVIDIA while reaffirming its commitment to supporting both TPUs and NVIDIA GPUs.

Sundar Pichai, CEO of Alphabet and Google

TPU, a 'lifesaving project' initiated a decade ago to address the bottleneck of AI computing efficiency, has now evolved into an 'economic pillar' for Google.

As a representative of self-developed ASIC chips, TPUs already possess the potential to shake NVIDIA's foundation. However, Google's logic is not about competing with NVIDIA on single-card performance; instead, it employs a completely different philosophy of hyperscale systems to redefine the future of AI infrastructure.

It all began 10 years ago, at the moment when TPUs were born.

01: The Evolution of TPUs

TPU v1

Google initiated the TPU project in 2015, not as a demonstration of technical prowess but as a response to the reality that without self-development, the company would struggle to support its future business scale.

As the application of deep learning expanded within Google, the engineering team recognized a critical issue looming ahead—core services such as search and advertising involve massive user requests. If these were fully transitioned to deep learning models, the power consumption of Google’s global data centers would surge to unsustainable levels. Even procuring more GPUs would fail to meet demand, let alone the skyrocketing costs.

At that time, GPUs were better suited for training large-scale neural networks, but their energy efficiency was not optimized for real-time online inference.

There were even predictions within Google: if all core businesses were to adopt deep learning models in the future, the global data center electricity costs would increase tenfold. Senior management realized that continuing to rely on the existing CPU and GPU roadmap was unsustainable.

Therefore, Google decided to develop its own ASIC accelerator. The goal was not to create the 'most powerful general-purpose chip,' but rather a 'highly energy-efficient chip optimized for specific matrix operations that could be deployed en masse in data centers.'

Ultimately, TPU v1 was officially put into use in 2016 to support Google Translate and some search functions, proving the feasibility of the ASIC solution.

After the publication of the Transformer paper in 2017, Google almost simultaneously realized that this new architecture featured highly regular computational patterns, extremely high matrix density, and astonishing parallelism, making it seemingly tailor-made for TPUs. Rather than waiting for external hardware vendors to catch up slowly, Google decided to take full control over the software framework, compiler, chip architecture, network topology, and cooling systems, forming an end-to-end closed-loop solution.

As a result, TPU evolved from being an isolated chip to becoming the foundation of Google's AI infrastructure: it aimed not only to train the world’s strongest models but also to allow AI to permeate every product line within the company at minimal cost.

Starting with v2 and v3, Google gradually made TPUs available to Google Cloud customers, marking the official entry into the commercialization phase.

Although early ecosystem development and compatibility still lagged behind GPUs, Google carved out a differentiated path through the use of the XLA compiler, efficient Pod architecture, liquid-cooled data centers, and deep co-design of hardware and software.

In 2021, TPU v4 was unveiled, the first to connect 4,096 chips into a super node, utilizing a custom-designed ring-topology network (2D/3D torus) to achieve nearly lossless cross-chip communication. This system enabled thousands of accelerators to work together like one 'giant chip,' directly propelling Google into the era of ultra-large-scale AI. The PaLM 540B model was trained on the v4 Pod.

Google demonstrated through action that as long as cluster scale is sufficiently large and interconnection efficiency is high enough, model performance will grow nearly linearly with computational volume. TPU’s network topology and scheduling system are the most critical hardware supports enabling this pattern.

From 2023 to 2024, TPU v5p became a turning point.

It has, for the first time, been deployed at scale into Google’s advertising systems, core search ranking algorithms, YouTube recommendations, real-time map predictions, and other revenue-generating product lines. Its performance is twice that of v4, while introducing a flexible node architecture that allows enterprise customers to scale up to nearly 9,000 chips on demand.

Leading model companies such as Meta and Anthropic have begun serious evaluations and procurement of TPU v5p, marking the transition of TPU from an 'internal secret weapon' to a viable 'ecosystem option.'

The sixth-generation TPU v6 (codenamed Trillium), released in 2024, makes Google’s stance clear: its future focus is no longer on training but on inference. Inference costs are becoming the largest single expense for global AI companies. The v6 is redesigned from architecture to instruction set specifically for inference workloads, with exponential growth in FP8 throughput, doubled on-chip SRAM capacity, deep optimization of KV Cache access patterns, significantly increased inter-chip bandwidth, and a 67% improvement in energy efficiency over the previous generation.

Google has publicly stated that the goal of this generation of TPU is to become the 'most cost-effective commercial engine of the inference era.'

From being forced to develop in-house in 2015 to address the efficiency bottlenecks of AI computing, to deploying TPU in customer-owned data centers by 2025, Google has spent a decade transforming what was once a 'life-saving necessity' into a strategic-level weapon capable of challenging NVIDIA’s dominance.

TPU was never about competing on performance metrics but enabling AI to operate efficiently and generate tangible profits. This distinctive approach is what sets Google apart and also makes it formidable.

02 From 'Experimental Project' to 'Data Center Lifeline'

TPU v7, codenamed Ironwood

In 2025, Google’s seventh-generation TPU (TPU v7, codenamed Ironwood) became the most anticipated hardware product in the global AI infrastructure sector.

This generation represents a complete overhaul in architecture, scale, reliability, networking, and software systems.

The advent of Ironwood formally declares the transition of TPU from the 'era of catch-up' to the 'era of offense,' also marking Google's recognition of the inference era as the main battleground for the next decade.

What makes Ironwood special is that it is the first dedicated inference chip in the history of TPU. Unlike the previous v5p, which focused on training, and v6e, which emphasized energy efficiency, Ironwood was designed from day one for the ultimate scenario of ultra-large-scale online inference and, for the first time, directly competes with NVIDIA’s Blackwell series on several key metrics.

At the single-chip level, Ironwood's FP8 dense computing power reaches 4.6 petaFLOPS, slightly higher than NVIDIA B200's 4.5 petaFLOPS, placing it among the top tier of global flagship accelerators. Its memory configuration is 192GB HBM3e, with a bandwidth of 7.4 TB/s, just a step away from B200’s 192GB/8 TB/s. The inter-chip communication bandwidth is 9.6 Tbps, which may be numerically lower than Blackwell’s 14.4 Tbps, but Google follows a completely different system-level approach, making simple numerical comparisons irrelevant.

What truly makes Ironwood a milestone is its ultra-large-scale expansion capability.

An Ironwood Pod can integrate 9,216 chips, forming a super node with FP8 peak performance exceeding 42.5 exaFLOPS. Google noted in its technical documentation that, under specific FP8 workloads, the performance of this Pod is equivalent to 118 times that of the closest competing system. This is not a gap at the single-chip level but rather a crushing advantage in system architecture and topology design.

The core supporting this scale is Google's decade-long refinement of a 2D/3D torus topology combined with Optical Circuit Switching (OCS) network.

Unlike NVIDIA's NVL72 (which only includes 72 GPUs) built using NVLink and high-end switches, Google has fundamentally abandoned the traditional switch-centric design, opting instead to directly connect all chips through a three-dimensional torus topology and achieve dynamic optical path reconstruction via OCS.

OCS is essentially an 'optical version of a manual telephone switchboard,' utilizing MEMS micro-mirrors to complete physical switching of optical signals at the millisecond level, introducing almost no additional latency. More importantly, when a chip failure occurs within the cluster, OCS can instantly bypass the faulty point, ensuring uninterrupted operation of the entire computing domain.

Thanks to this, the annual availability of Google’s liquid-cooled Ironwood system reaches 99.999%, meaning less than six minutes of downtime per year. This figure is terrifying in the realm of ultra-large-scale AI clusters, far surpassing the common levels of GPU-based training clusters in the industry.

Google has fully upgraded its TPU clusters from 'experimental toys' to the 'lifeblood of data centers.'

In inference scenarios, Ironwood demonstrates system-level dimensionality reduction capabilities. The entire node provides 1.77 PB of high-bandwidth HBM, with all chips able to access it almost equidistantly, which is crucial for KV cache management. In the era of inference, the most expensive resource is not computing power but memory bandwidth and cache hit rates. Ironwood significantly reduces redundant computations through shared massive high-speed memory and extremely low communication overhead.

Internal tests show that under equivalent workloads, Ironwood’s inference cost is 30%-40% lower than GPU flagship systems, with even greater reductions in extreme scenarios.

The software layer is equally robust. The MaxText framework fully supports the latest training and inference technologies, GKE topology-aware scheduling intelligently assigns tasks based on real-time Pod status, and the inference gateway supports prefix cache-aware routing. After comprehensive optimization, first-token latency decreased by up to 96%, and overall inference costs were further reduced by 30%.

Ironwood not only propels the Gemini series to maintain its leadership but also directly stimulates the external ecosystem.

Anthropic announced that future training and deployment of the Claude series will utilize up to one million TPUs. Even players with alternative options like AWS Trainium cannot ignore Ironwood’s generational advantage in ultra-large-scale inference.

03, Google, NVIDIA, and Amazon Stand at the 'Crossroads'

CNBC, after analyzing the three major players in the AI chip field—Google, NVIDIA, and Amazon—noted that all three are heavily investing in R&D, but their goals, business models, ecosystem-building approaches, and hardware philosophies differ significantly.

These differences profoundly influence chip design, performance priorities, customer adoption pathways, and market positioning.

NVIDIA’s strategy has consistently revolved around GPUs, whose core value lies in versatility.

GPUs boast massive parallel computing units capable of supporting various workloads, from deep learning to graphics rendering and scientific computing. More importantly, the CUDA ecosystem has virtually locked in the development paths of the entire industry; once a model or framework is optimized for CUDA, switching to another chip architecture becomes exceedingly difficult.

NVIDIA has achieved a monopolistic capability akin to Apple's ecosystem in the consumer market through deep software-hardware bundling, but the shortcomings of GPUs are also quite evident.

Firstly, GPUs are not optimized for inference; they are designed for high-speed parallel computing rather than executing repetitive inference instructions at the lowest cost. Secondly, the flexibility of GPUs means their hardware resources may not be optimally configured for real-world inference scenarios, leading to lower efficiency per unit of energy consumption compared to ASICs. Lastly, NVIDIA holds significant pricing power, often requiring cloud providers to purchase GPUs at prices far above manufacturing costs, resulting in what is now widely known as the 'NVIDIA tax.'

Google’s approach differs from NVIDIA’s. Google does not pursue hardware versatility but instead focuses on achieving ultimate efficiency for deep learning, particularly Transformer workloads. The core of TPU is the systolic array, an architecture specifically designed for matrix multiplication, making it exceptionally efficient in deep learning computations.

Google does not aim for TPU to become a general-purpose industry chip but rather to be the most efficient dedicated chip for global AI inference and training, thereby enabling Google's entire AI system to achieve leading performance, the lowest cost, and the widest deployment.

Google's core advantage lies in its full-stack integration capabilities. They control not only the chips but also the models, frameworks, compilers, distributed training systems, and data center infrastructure. This allows Google to implement many system-level optimizations that GPUs cannot achieve.

For instance, the data center network topology is entirely designed to serve TPU supernodes, and the scheduling system at the software level can automatically adjust hardware resource usage based on model characteristics. This 'system-level integration' is something NVIDIA cannot achieve because NVIDIA controls only the GPU, not the customer’s data centers.

Amazon, on the other hand, has taken a third route. The starting point of their chip strategy is to reduce AWS infrastructure costs while minimizing reliance on external suppliers, particularly NVIDIA, leading them to develop Trainium and Inferentia.

As a cloud provider, AWS prioritizes scale effects and economic efficiency rather than building a unified AI computing system like Google.

Trainium’s design is more flexible, approaching the adaptability of GPUs in many cases, with performance optimized for both training and inference. Inferentia, meanwhile, focuses on inference, suitable for high-throughput deployment scenarios. Amazon reduces internal costs through these chips and passes some savings on to customers, enhancing AWS's competitiveness.

In summary, NVIDIA’s approach emphasizes generality, ecosystem-driven strategies, and software lock-in; Google’s approach focuses on specialization, vertical integration, and system unification; while Amazon’s approach is centered on cost optimization, cloud-driven strategies, and compatibility with commercial needs. These differing approaches have led to markedly different product forms, business strategies, and competitive landscapes in the AI chip market.

04. Leverage TPU to Eliminate the Expensive 'CUDA Tax'

The key to Google's significant advantage in the era of inference lies not only in the hardware performance of TPUs but also in its full-stack vertical integration strategy.

This strategy allows Google to avoid the expensive 'CUDA tax' and creates a substantial cost structure advantage over OpenAI and other companies reliant on GPUs.

The so-called CUDA tax refers to the high profit margins layered onto GPU chips during the production-to-sales process.

The cost of NVIDIA’s GPUs is typically only a few thousand dollars, but when sold to cloud providers, prices often reach tens of thousands of dollars, with gross margins exceeding 80%. Nearly all global technology companies training large models are compelled to bear this cost and cannot escape it.

OpenAI relies on NVIDIA GPUs for both training and inference. Given the massive parameter scale and extensive inference requirements of GPT-series models, its total computing expenditure far exceeds the total revenue of most companies.

NVIDIA’s pricing model makes it difficult for these companies to achieve scalable commercial profits, regardless of how much they optimize their models.

Google’s approach is fundamentally different. By using its self-developed TPUs for training and inference, Google controls the entire supply chain—from chip design to manufacturing, network solutions, software stacks, and data center layouts—all optimized internally by Google.

Since there is no need to pay the NVIDIA tax, Google’s computing cost structure inherently holds an advantage over OpenAI.

Not only does Google enjoy low costs internally, but it also extends this cost advantage to its Google Cloud customers. Through TPU services, Google can offer clients lower-cost inference capabilities, thereby attracting numerous model companies and enterprises to migrate to the Google platform.

According to a report by tech website venturebeat.com, Google holds a significant structural advantage over OpenAI in terms of computing cost. This implies that when providing equivalent inference services, Google’s underlying costs may only be 20% of its competitor’s. Such a substantial cost differential is decisive in the era of inference.

When a company’s inference costs constitute the majority of its expenditures, migrating to the lowest-cost platform becomes an inevitable choice. For instance, a company may spend tens of millions or even hundreds of millions of dollars annually on inference. If migrating to TPU can save 30% to 50% of costs, such migration would almost certainly become an unavoidable business decision.

Google has also launched the TPU@Premises program, deploying TPUs directly within enterprise data centers, enabling customers to utilize inference capabilities locally with minimal latency. This further reinforces Google's cost advantage and expands the commercial reach of TPU.

In OpenAI’s business model, its most critical cost stems from computing power, whereas in Google’s business model, computing costs are integrated into its self-developed product ecosystem and can be recouped through Google Cloud. The deep integration of Google’s hardware, software, networking, and cloud infrastructure grants it true vertical integration capabilities.

This integration does not merely reduce costs but drives a reconfiguration of the entire ecosystem.

As more enterprises recognize the importance of inference costs, Google’s cost advantage will continue to be magnified, and TPU’s market share will grow more rapidly in the era of inference. The vertical integration strategy of TPU will ultimately serve not only as Google’s competitive approach but also as a force reshaping the competitive order of the entire industry.

05, Google’s 'Economic Pillar'

Reviewing the development history of TPU reveals a typical trajectory of “catching up to leading.”

In its early stages, TPU lagged behind GPUs in terms of ecosystem maturity, compatibility, and training performance. It was widely believed that Google had been surpassed by OpenAI in the era of large AI models. However, this external perception overlooked Google’s deep accumulation at the infrastructure level and its unique advantages in full-stack systems.

With the generational upgrades of the Gemini series models, Google has gradually proven itself as one of the few companies globally capable of achieving training stability, inference cost control, and full-stack performance optimization, with TPU playing a key role in this process.

Both the training and inference of the Gemini 2.0 multimodal model were completed on TPUs, whose high efficiency enables Google to train large-scale models at a relatively low cost, resulting in shorter iteration cycles and lower costs.

As companies enter the era of inference, the role of TPUs has shifted from supporting Google's internal models to serving global enterprise customers. This shift has significantly increased AI-related revenue for Google Cloud, with the cloud division's financial report showing an annualized revenue of $44 billion, becoming a key driver of Google’s overall performance growth.

Google has long lagged behind AWS and Azure in the cloud market competition, but a new track has emerged in the AI era, allowing Google to take the lead in AI infrastructure. This leadership is not accidental but a natural outcome of years of accumulation with TPUs.

Against the backdrop of accelerated enterprise adoption of AI, an increasing number of companies require model deployment solutions that offer low inference costs, high stability, and strong performance. While GPUs are powerful, they face limitations in cost and supply, whereas TPUs provide a more economical and stable alternative. Particularly in large-scale online inference scenarios, the advantages of TPUs are especially prominent.

More importantly, Google does not rely solely on its chips as a selling point but attracts enterprises with comprehensive solutions.

For example, Google provides an integrated system covering model training, model monitoring, vector databases, inference services, and data security, with TPUs functioning as the foundational infrastructure. Google positions itself as a complete platform for enterprise AI adoption, giving it a new competitive edge over AWS and Azure through differentiated offerings.

In the coming years, competition in the AI industry will shift from model capabilities to cost efficiency, from training capacity to inference scale, and from ecosystem building to infrastructure integration. With TPUs, a global data center footprint, a steady upgrade cycle, and full-stack capabilities, Google is well-positioned to build stronger competitive barriers in this new phase than it has in the past decade.

Google’s transition from a follower to a leader did not happen overnight but was the result of a decade-long commitment to infrastructure investment, in-house R&D, and continuous adjustments to its model strategy. TPUs represent Google’s most enduring, profound, and strategically significant asset in the AI era, and this asset is now becoming the main engine driving Google’s market capitalization growth, the rise of its cloud business, and the reshaping of its AI business model.

Editor /rice

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment