A developer’s guide to training with Ironwood TPUs

A developer’s guide to training with Ironwood TPUs

The transition toward trillion-parameter AI models has created exponential demand for computational resources, testing the limits of traditional infrastructure. That’s why we built Ironwood, our seventh-generation TPU.

But having the hardware is half of what you need. You also need the right software optimization to extract every ounce of performance. In this edition, we hear from Lillian Yu, CPA, CA , Product Strategy and Operation, and Liat Berry , Product Manager, on five strategies within the JAX and MaxText ecosystems designed to help developers refine training efficiency and hit peak performance on Ironwood hardware.

Article content

5 ways to optimize your training on Ironwood

1. Double your training throughput with native FP8 on Ironwood

Ironwood is our first TPU generation to offer native 8-bit floating point (FP8) support, allowing you to theoretically double your model training throughput compared to BF16. You can train massive AI models much faster and more efficiently without sacrificing model quality. Instead of dealing with complex workarounds, your team can easily apply these FP8 training recipes using existing tools.

How to get started:

  • Use the Qwix library to implement your FP8 training recipes.
  • Simply specify the relevant flags within your MaxText configuration to apply FP8 precision for weights, activations, and gradients.
  • Read our full guide, Inside the optimization of FP8 training on Ironwood, in the Google Developer forums for more details.

2. Accelerate with Tokamax kernels

Tokamax is a library of high-performance JAX kernels optimized for TPUs. These kernels are designed to mitigate specific bottlenecks through the following mechanisms:

  1. Splash attention: This mechanism addresses the I/O limitations inherent in standard attention processes. By maintaining computations within on-chip SRAM, it is particularly effective for processing long context lengths where memory bandwidth typically becomes a constraint. 
  2. Megablox Grouped Matrix Multiplication (GMM): This manages the “ragged” tensors often found in Mixture of Experts (MoE) models. By utilizing GMM, the system avoids inefficient padding and ensures higher utilization of the MXU. 
  3. Kernel tuning: The Tokamax library includes Utilities for hyperparameter optimization. These tools allow for the adjustment of tile sizes and other configurations to align with the specific memory hierarchy of the Ironwood TPU.

3. Offload collectives to SparseCore

The fourth-generation SparseCores in Ironwood are processors specifically designed to manage irregular memory access patterns. By using specific XLA flags, users can offload collective communication operations—such as All-Gather and Reduce-Scatter—directly to the SparseCore.

This offloading mechanism allows the TensorCores to remain dedicated to primary model computations while communication tasks execute in parallel. This functional overlap is a critical strategy for hiding communication latency and ensuring consistent data throughput to the MXUs.

4. Fine-tune the memory pipeline on VMEM

VMEM, a critical part of the TPU memory architecture, is a fast on-chip SRAM that is designed to optimize kernel performance. You can improve the overall speed of execution by tuning the allocation of VMEM between  current operation and future weight prefetch. For example, increasing the VMEM reserved for the current scope allows increasing the tile sizes used by the kernel, which can increase kernel performance by removing potential memory stalls. 

How to get started:

  • Allocate more VMEM to your current scope so you can increase your kernel tile sizes.
  • Use this configuration to eliminate potential memory stalls and keep your systems running quickly.
  • Refer to TPU Pipelining for more on TPU memory architecture.

5. Boost your models performance with the right sharding strategy

MaxText supports several parallelism techniques across all TPUs. By choosing the right sharding strategy based on your model size, architecture, and sequence length, you can significantly improve your training performance.

How to choose your strategy:

  • Fully Sharded Data Parallelism (FSDP): This is the preferred strategy for training large models that exceed the memory capacity of a single chip. FSDP shards model weights, gradients, and optimizer states across multiple chips. Increasing the per-device batch size and introducing more compute can hide the latency of the All-Gather operations and improve efficiency.
  • Tensor Parallelism (TP): Shards individual tensors. Given Ironwood's high arithmetic intensity, TP is most effective for very large model dimensions. Leveraging TP with a dimension of 2 can take advantage of the fast die-to-die interconnect on Ironwood's dual-chiplet design.
  • Expert Parallelism (EP): Helpful for MoE models to distribute experts across devices.
  • Context Parallelism (CP): Necessary for very long sequences, sharding activations along the sequence dimension.
  • Hybrid approaches: Combining strategies is often required to balance compute, memory, and communication on large-scale runs.

See the Optimizing Frontier Model Training on TPU v7x Ironwood post in the Developer forums for more detail on techniques 2-5 above.

The Ironwood advantage: System-level performance

These optimization techniques, coupled with Ironwood's architectural strengths like the high-speed 3D Torus Inter-Chip Interconnect (ICI) and massive HBM capacity, create a highly performant platform for training frontier models. The tight co-design across hardware, compilers (XLA), and frameworks (JAX, MaxText) ensures you can extract maximum performance from your AI Infrastructure.

Ready to accelerate your AI journey? Explore the resources below to dive deeper into each optimization method. Further reading: Inside the optimization of FP8 training on Ironwood



Such a crucial insight! As an aspiring tech engineer, seeing how Google Cloud tackles the immense infrastructure challenges of trillion-parameter models is incredibly inspiring. The reminder that raw hardware like the Ironwood TPUs must be perfectly paired with deep software optimization is a brilliant lesson for future developers. Thank you for putting together this fantastic guide!

Like
Reply
Eko Satrio

Chief Executive Officer, PT Momentum Teknodata Semesta

2mo

The constraint is no longer raw compute, but synchronization between compute, memory, and communication layers; FP8, Tokamax, and SparseCore are just different ways of collapsing that gap. Once that coordination becomes the dominant variable, scaling stops being a hardware problem and becomes a system orchestration problem.

Like
Reply

The transition towards input and output of ai are powered demand.With the help of seventh-generation TPU testing built ironwood,with help of hardware ,need software for the ounce of performance,developer trained hit peak performance on ironwood hardware.

Like
Reply

Ironwood TPUs shift the game from training bottlenecks to inference at scale optimized for the “thinking models” era where reasoning and embeddings define real-world AI value.

Like
Reply

To view or add a comment, sign in

More articles by Google Cloud

Others also viewed

Explore content categories