Bottlenecks that once hid behind peak FLOP charts had begun showing up in the places that matter most—latency-bound inference paths, goodput on sprawling training jobs, and the hard ceilings of data center power—which set the stage for a deliberate split in silicon designed to tame the opposing forces of scale and speed. Google’s eighth-generation Tensor Processing Units arrive as a matched pair: TPU 8t for training and TPU 8i for inference, each built to the grain of today’s models and the orchestration patterns of agent-based computing. Rather than chase a single do-it-all part, the company tied chip choices to system architecture, co-developed software, and facility constraints, arguing that specialization now pays back more than all-purpose design. The result is a platform pitched not as a chip drop but as an end-to-end stack, where interconnects, memory, cooling, and frameworks move as one to accelerate development cycles and lower the cost of serving complex reasoning workloads.
The Case for Split TPUs
The rationale for bifurcation traces directly to two compounding forces: production-scale inference for top-tier models and the ascent of agentic workflows that multiply low-latency interactions into constant back-and-forth exchanges. Training and serving no longer strain the same levers. Training demands overwhelming compute density, high-bandwidth synchronization across thousands of dies, and a data path that can sustain terabytes per second without stalls. Inference, by contrast, punishes cache misses, collective-operation delays, and network diameter more than raw peak. Against that backdrop, a single general-purpose chip would either waste silicon on the wrong bottlenecks or drag critical pipelines through compromises. By drawing a bright line between 8t and 8i, Google attempted to give each workload its own lane.
This separation also reflected a broader shift in where real performance lives: inside software-defined communication patterns and between chips rather than solely within them. Reasoning-heavy models, mixtures of experts, and multi-agent orchestrators hammer interconnects, push key-value caches to the brink, and expose any mismatch between host CPUs and accelerators. The split design tackles those mismatches at the root. TPU 8t raises compute throughput and scale-up bandwidth to pull frontier model training from months toward weeks, while TPU 8i narrows every path that can stall inference—on-chip SRAM for working sets, low-hop topologies for collectives, and host configurations tuned for rapid, concurrent requests. Specialization here is not an aesthetic choice; it is a response to measurable inefficiencies that compound at scale.
TPU 8t: Built for Frontier Training
TPU 8t centers on one principle: keep massive arrays working in concert without starving cores or fragmenting memory. Per pod, it delivers roughly three times the compute performance of the prior generation, and it scales a single superpod to 9,600 chips with two petabytes of shared high-bandwidth memory. Interchip bandwidth doubles, enabling 121 exaflops of aggregate compute to act against a unified memory pool rather than sharded islands. That scale aims squarely at trillion-parameter training regimes, where parameter exchange, gradient aggregation, and optimizer state handling dominate timelines. By keeping parameters and activations in a giant pool, 8t trims the painful tail latencies that emerge when working sets drift across boundaries.
Feeding those arrays is equally important, so TPU 8t brings 10x faster storage access and TPUDirect paths that push data straight into the accelerator, shrinking CPU mediation and queueing. On the fabric, the new Virgo Network and tight integration with JAX and Pathways target near-linear scaling, not just across pods but toward a single logical cluster that could span a million chips. The surrounding Reliability, Availability, and Serviceability stack then guards productive time: real-time telemetry across tens of thousands of dies, dynamic rerouting around faulty links without job interruption, and Optical Circuit Switching that reconfigures paths automatically. The company set more than 97% goodput as a design target, recognizing that at frontier scales a lost percentage point can translate into days of retraining, costly checkpoint churn, and idle engineers.
TPU 8i: Inference at Agent Speed
Inference in the agent era is a different game, dominated by the memory wall, cache locality, and network diameter. TPU 8i answers with 288 GB of HBM paired with 384 MB of on-chip SRAM, tripling SRAM over the last generation to hold the hot working set—including key-value caches used by reasoning models—directly on die. That choice pursues a simple goal: turn cache fetches into near-instant hits, prevent bubbles on the execution pipeline, and cut the worst-case latency spikes that derail interactive experiences and multi-step planning. The chip’s memory hierarchy was sized against real production footprints, not theoretical maxima, to minimize costly evictions during long reasoning chains.
System design carries the rest of the load. Physical CPU hosts per server doubled, and the platform shifted to Axion ARM-based CPUs with non-uniform memory architecture isolation to match host resources with accelerator needs. Interconnect bandwidth for Mixture-of-Experts traffic climbed to 19.2 Tb/s, while a Boardfly topology reduced maximum network diameter by more than half, collapsing the number of hops between endpoints that must coordinate frequently. An on-chip Collectives Acceleration Engine offloads and speeds global operations, cutting on-chip latency by up to 5x. Together, these moves chase a specific outcome: inference pipelines that stay fully engaged even as agents branch, consult experts, and rejoin. The math follows from field results the company shared—about 80% better performance-per-dollar over the prior generation—making a direct case for serving nearly double the traffic under the same cost envelope.
Co-Design from Silicon to Software
The throughline for both chips is co-design, not as a slogan but as an engineering loop: model behavior guides silicon, silicon choices shape topology, and topology rearranges software abstractions. Boardfly emerged from measuring communication patterns in contemporary reasoning and MoE models, then drawing a fabric that shortens the exact paths those patterns use most. TPU 8i’s SRAM target stemmed from production key-value cache sizes, making the die large where it mattered and restrained where it did not. On the training side, Virgo’s bandwidth aligns with parallelism strategies that have proven effective for models at the trillion-parameter threshold, avoiding overprovision where collective calls taper and bulking up where they spike.
Host alignment closes the loop. For the first time, both chips ride on Axion ARM-based CPU hosts, an architectural consolidation meant to eliminate mismatches between host scheduling, memory behavior, and accelerator queues. That unified base simplifies orchestration layers and reduces the tax of translating between heterogeneous servers. The software stack then tries to make the hardware’s advantages obvious: JAX for high-performance array programming, MaxText for reference large-model training runs, PyTorch for broad developer reach, and inference engines such as SGLang and vLLM for token-efficient serving. Bare metal access acknowledges that some teams need precise control to squeeze latency out of every rung in the stack, while open-source artifacts like MaxText and Tunix for reinforcement learning offer a tested path from experimentation to scale without rewriting for production.
Efficiency, Operations, and Platform Integration
Power has become the binding constraint for many data centers, so the 8-series treats efficiency as a primary metric alongside throughput. Against the previous Ironwood generation, both chips target up to 2x better performance-per-watt, achieved by integrating network connectivity with compute on the die to cut the energy cost of data movement. Integrated power management modulates draw against real-time load, avoiding the waste of running flat out during I/O-bound phases or while waiting on collectives. At the system boundary, fourth-generation liquid cooling supports densities that air cannot sustain over long runs, keeping temperatures, and therefore clock speeds, in their optimal bands. The company reported six times more compute delivered per unit of electricity compared with five years earlier, attributing the gain to the cumulative effect of these design choices.
Operations tighten the picture. Goodput, not just peak, governs whether large training runs finish on time, so the RAS stack stretches from silicon to software to steer around failures in line. Telemetry streams continuously across tens of thousands of components, allowing automated systems to detect, isolate, and route around faults before users see a job restart. Topology-aware schedulers place workloads where network distance and memory locality match the model’s communication graph. All of this flows into a single delivery vehicle: general availability slated for later this year within the AI Hypercomputer stack, where compute, storage paths, and networking are pre-aligned with the new chips. Early deployment by Citadel Securities served as evidence that external workloads with demanding latency and throughput profiles had already begun shaking down the platform at scale.
What Teams Should Do Now
The path forward favored preparation over wait-and-see. Teams working on frontier training were advised to map parallelism strategies and optimizer state layouts against TPU 8t’s shared HBM and Virgo fabric, then pilot MaxText-based runs to validate step-time and gradient accumulation choices before full-scale commitments. Data ingress plans should have been revisited to exploit 10x faster storage paths and TPUDirect, and telemetry integration was expected to be wired into existing observability stacks so goodput metrics could guide topology-aware placement during ramp. For inference builders, SRAM footprints for key-value caches should have been measured precisely and tuned to fit 8i’s on-die capacity, while collective-heavy stages were profiled to benefit from the Collectives Acceleration Engine and the Boardfly topology’s reduced diameter.
Procurement and facilities leads were encouraged to treat efficiency improvements as capacity multipliers rather than marginal gains. Power budgets could have been reforecast using the stated 2x performance-per-watt targets, while cooling designs were cross-checked against fourth-generation liquid requirements to avoid last-minute retrofits that erode ROI. Software leaders planning migrations were expected to lock on Axion-based host baselines, standardize on JAX, PyTorch, SGLang, or vLLM where appropriate, and determine which services warranted bare metal for strict latency SLOs. Finally, organizations aiming at agent-based systems were advised to stage rollouts that couple TPU 8i-backed serving with shorter TPU 8t-driven training cycles, so iterative tool use, expert routing, and planning loops could have been refined weekly rather than quarterly, aligning engineering cadence with the hardware’s intended strengths.
