Storage best practices for AI/ML workloads on TPU VMs
To maximize the performance and cost-efficiency of your AI/ML workloads on TPU VMs, select and configure the right storage solution for your workload. By removing I/O bottlenecks, you can reduce the amount of time that your TPU accelerators are idle, which reduces training time and costs.
This document provides workload-specific storage recommendations and optimization best practices for training, checkpointing, serving, and caching on TPU VMs. Before applying these practices, review the available Storage options for TPU data. This document assumes that you are familiar with TPU VMs and have basic experience provisioning Cloud Storage resources.
Workload-specific guidance
The following table provides storage recommendations, listed in order of preference, for different workloads:
| Workload | Recommendation | Optimization and tooling (if applicable) |
|---|---|---|
| Training datasets, including data preparation |
|
|
| Checkpointing and reinforcement learning weights |
|
For Rapid Bucket, use the
gcsfusecsi-checkpointing profile.
|
| Model storage and download |
|
For downloading models, use
GKE Run:ai Model Streamer
or Cloud Storage FUSE with a separate mount point using the
gcsfusecsi-serving profile.
|
| Key-value (KV) cache offload |
|
Cloud Storage optimizations
The following sections describe best practices to optimize performance when using Cloud Storage with TPU VMs.
Enable hierarchical namespace for metadata optimization
To improve metadata performance, enable hierarchical namespace when you create regional buckets for AI/ML workloads. Metadata performance refers to how quickly Cloud Storage can process operations that involve looking up, listing, or modifying object paths and folders, rather than reading or writing the file contents themselves.
In buckets without hierarchical namespace enabled, folders don't exist as actual
resources, but are instead simulated folders
represented by object name prefixes delimited by forward slashes (/). This
makes operations like listing directory contents or renaming directories slow
because the system must scan all objects with that prefix. Hierarchical
namespace provides a true file system structure, which is important for AI/ML
workloads for several reasons:
- Atomic directory renames: ML frameworks use directory renames to finalize checkpoints. Hierarchical namespace supports atomic renames, ensuring that checkpoints are finalized quickly.
- Higher initial QPS: Hierarchical namespace supports up to eight times higher initial queries per second (QPS) for reads and writes compared to buckets without hierarchical namespace enabled. This prevents bottlenecks when many TPUs access storage simultaneously.
- Efficient folder-level operations: Finding and listing files within specific directories is much faster, reducing response times during training and data loading.
Zonal buckets, offered through Rapid Bucket, use hierarchical namespace by default. For more information, see Hierarchical namespace overview.
Use Cloud Storage FUSE with appropriate profiles
Cloud Storage FUSE is a FUSE adapter that lets you mount buckets as a local file system. When using Google Kubernetes Engine, we recommend using the Cloud Storage FUSE CSI driver and the Cloud Storage FUSE profiles to automate performance tuning.
For more information about best practices for using Cloud Storage FUSE, see Performance tuning best practices.
Customize the TPU VM boot disk
You can customize the guest OS environment on a TPU VM by using startup scripts or by creating custom images. Customizing the boot disk is useful for the following scenarios:
- Pre-loading software and libraries: Install specific ML frameworks, dependencies, or custom software to reduce VM startup time and ensure consistent environments.
- Using non-standard OS distributions: Use an OS distribution or version not included in the Google-managed list.
- Applying security and monitoring configurations: Apply custom security settings, install monitoring agents, or set environment variables.
However, boot disk recovery for TPU VMs is limited. You can't detach or snapshot the boot disk for offline repair, so use caution when making changes that affect the boot process. By following these best practices, you can reduce the risk of boot failures when customizing your TPU VM environments.
Keep the following key principles in mind when customizing your boot disk:
Minimize boot disk modifications: Whenever possible, install applications and store data on Persistent Disk or Hyperdisk volumes rather than heavily modifying the boot disk.
Use UUIDs for mounting: When adding entries to the
/etc/fstabfile, always use UUIDs to identify disks and partitions (UUID=...) rather than device names like/dev/sdb1. Auto-generated device names are not guaranteed to be stable across reboots.
Follow these recommendations to reduce the risk of boot failures when making system changes:
Error handling: Implement robust error checking and graceful failure modes in your scripts. Log detailed messages to the serial console and Cloud Logging to aid in debugging.
Critical dependencies: Be extremely careful when modifying files essential for booting, such as the
/etc/fstabfile, network configurations, or bootloader settings. A syntax error or incorrect entry can render the VM unbootable.Secondary disks: If your script relies on secondary disks, ensure it handles cases where the disk might not be present or takes longer to attach than expected. Avoid making the boot process critically dependent on secondary disk mounts unless absolutely necessary.
The following are examples of recommended and not recommended
/etc/fstabentries for mounting a secondary disk:- Recommended:
UUID=a1b2c3d4-e5f6-7890-1234-567890abcdef /mnt/mydata ext4 defaults,nofail 0 2 - Not recommended:
/dev/sdb1 /mnt/mydata ext4 defaults 0 2
Using the
nofailoption can prevent the system from halting if the disk isn't found, but ensure your application can handle the mount point being unavailable.- Recommended:
Package management: Be cautious when adding third-party repositories. Ensure they are trusted and compatible with the base OS image. Understand the dependencies of any packages you install and their potential impact on system libraries.
Disk space: Monitor boot disk usage. Extensive logging or large software installations can fill the boot disk, preventing the VM from starting.
Logging: Configure your applications and scripts to log verbosely to the serial console, as this is the primary tool for diagnosing boot issues on TPU VMs.
Plan your storage capacity
It's important to plan the amount of storage capacity that your workload will need to fully utilize your accelerators. This includes the storage capacity and checkpoint bandwidth.
Estimate storage
You can use the following estimates as a starting point for your storage requirements:
| Workload type | Dataset storage | Checkpoint storage |
|---|---|---|
| LLM pre-training | 2 TB per TPU | 200 GB per TPU |
| Multimodal training | 12 TB per TPU | 1 TB per TPU |
| Inference | 1 TB per TPU | 1 GB per TPU |
Estimate checkpoint bandwidth
You can estimate the minimum checkpointing bandwidth required for training workloads using the following formula. For data reads, multiple training runs, or training and inference, increase your estimated bandwidth requirements proportionally.
- Checkpoint size: Number of parameters × bytes per parameter (approximately 12-16 bytes per parameter for FP16 + optimizer state). Add a buffer (approximately 3×) for optimizer states and different precisions.
- Checkpoint interval: How often you save a checkpoint (for example, every 15 minutes).
- Required bandwidth: Checkpoint size ÷ checkpoint interval.
The following example shows how to estimate the minimum checkpointing bandwidth for Qwen3-72B:
- Checkpoint size: 72B parameters × 12 bytes ≈ 864 GB per checkpoint. With buffer, 3 × 864 GB ≈ 2.5 TB.
- Checkpoint interval: 2 minutes = 120 seconds.
- Required bandwidth: 2.5 TB ÷ 120 seconds ≈ 20 GBps.
Reference recipes
For examples of storage configurations for specific hardware and workloads, see the following recipes:
- TPU7x inference:
- TPU7x training:
Quotas and bandwidth limits
The bandwidth for Cloud Storage and Compute Engine offerings is limited by default quotas. If you exceed a quota, your input and output requests might be throttled.
For information about Cloud Storage quotas and how to request increases, see Quotas & limits in the Cloud Storage documentation. For information about Compute Engine quotas for Hyperdisk and Persistent Disk, see Disk quotas.
What's next
- Storage options for TPU data
- Connect TPU VMs to Cloud Storage buckets
- Attach durable block storage to a TPU VM
- Cloud Storage FUSE performance tuning best practices