Jump to Content
Developers & Practitioners

Scaling LLM Inference: Multi-Node KV Cache Offloading with GKE & Managed Lustre

July 1, 2026
Miro Nikolov

Staff Software Engineering Manager, Google Cloud Managed Lustre

Barak Epstein

Senior Product Manager, Google Cloud Managed Lustre

Significant contributors to this article include Sneha Aradhey, Software Engineer, Google Kubernetes Engine, and Michael MacDonald, Sr Software Engineer, Google Cloud Managed Lustre.

Enterprise production environments are shifting to distributed, multi-node architectures to serve long-context window lengths and agentic AI. As these workloads scale, KVCaches often outgrow local CPU RAM and host SSD cache tiers.

To handle this, some setups attempt to pool node-local storage into a distributed layer (such as multi-node pooled NVMe arrays). Pooling SSDs aggregates raw capacity and often leverages spare local drives, presenting clear advantages. However, there are some limitations: the approach requires the compute cluster to manage its own complex data distribution and cross-node replication.

An alternative is to offload the attention state to a dedicated, high-performance external parallel filesystem. We utilize Google Cloud Managed Lustre with the llm-d offloading stack as a cluster-wide decentralized attention cache tier, bypassing host-level capacity limits and eliminating the networking overhead of managing local pooled drives.

With this approach, we achieve efficiency at scale:

Google Cloud Managed Lustre enables over 50% TCO savings and reduces GPU-hour requirements for Llama-3.3-70B inference on a six-node A3 Mega cluster by nearly 60%. These gains are realized by offloading shared, prefilled KV caches to Lustre’s high-performance tier with a 95% cache hit rate.

Benchmark Configuration

  • Model: Llama-3.3-70B
  • Context Dynamics: Prompt length of 50,000 tokens, input question length of 256 tokens, and output length of 512 tokens.

Extension of Lustre KV Cache solution with CPU RAM offload

The Managed Lustre KV Cache offload architecture can be extended via integration of offload to CPU RAM. This hybrid approach significantly improves performance compared to CPU offload only, delivering approximately 40% improvement in Time to First Token (TTFT) and a 30% reduction in end-to-end latency, for Llama-3.3-70B inference.

User Guide

Architectural Components

  • GKE GPU Nodes: Dedicated accelerator resources provisioned exclusively for high-throughput model execution and tensor-parallel operations.
  • Managed Lustre: A shared, high-bandwidth parallel filesystem acting as a centralized external tier that caches prefilled attention states to eliminate redundant prefill computation.
  • PVC Evictor: A scalable, distributed garbage collection service that tracks file access patterns and automatically removes Least-Recently-Used (LRU) cache chunks to maintain healthy storage headroom.

Target Models

This guide provides two distinct, validated tracks for deployment depending on your model preference:

  1. Qwen Series: Qwen/Qwen3.5-35B-A3B
  2. Gemma 4 Architecture: google/gemma-4-31B-it

Architectural Diagram

https://storage.googleapis.com/gweb-cloudblog-publish/images/Scaling_LLM_Inference__Multi-Node_KV_Cache.max-2200x2200.png

Before You Begin

Before starting this deployment, ensure your Google Cloud project is properly configured:

  • Quota: Verify you have sufficient quota for the selected accelerators in your chosen region, as well as adequate general CPU, memory, and Managed Lustre quotas.
  • Validate Required IAM Permissions for Managed Lustre
  • Prepare your Environment to Connect to Managed Lustre: Complete the “Before You Begin” steps to enable APIs, set up environment variables, and set up your VPC.
    • GKE Version: The Managed Lustre CSI driver is supported on GKE versions 1.33 or later. For the best experience and default port (988) usage, GKE version 1.33.2-gke.4780000 or later is recommended.

Overview of Required Steps

  • Create the GKE Cluster
  • Create the GPU Compute node pool
  • Provision Lustre storage
  • Deploy vLLM Serving Engine with Lustre
  • Deploy the PVC Evictor
  • Clean Up

1. Create the GKE Cluster

Create a rapid-channel GKE cluster with Workload Identity and all necessary CSI storage add-ons enabled (Lustre, GCSFuse and Persistent Disk).

Loading...

2. Create the GPU Compute Node Pool

Provision an GPU VM node pool ( e.g. a3-megagpu-4g, a4-highgpu-4g, etc.).

Loading...

3. Provision Lustre Storage (Auto-provisioned)

Before deploying vLLM, you need to provision the Lustre storage. We use an auto-provisioned Lustre instance via a StorageClass and a PersistentVolumeClaim (PVC).

Create a file named lustre-pvc.yaml with the following content:

Loading...

Notes: Performance tier options are “125”, “250”, “500”, and “1000”.  Per-tier capacity ranges and increments can be found here.

Apply this manifest to provision the Lustre instance and observe provisioning:

Loading...

4. Deploy vLLM Serving Engine with Lustre

Step 4a: Create the Hugging Face Access Secret

Before submitting the deployment manifest, you must provision your Hugging Face API token as a secure secret within the cluster.

Run the following command, replacing `<INSERT_HF_TOKEN>` with your token:

Loading...

Step 4b: Create the vLLM Deployment Manifest

This complete Kubernetes manifest deploys the vLLM engine, configures the llmd-fs-connector for high-performance KV-caching, and mounts your parallel Lustre storage (lustre-pvc).

Common Manifest (Choose between Qwen3.5 or gemma-4)

Replace example values between <> with appropriate values for your environment.

Loading...

Note: Qwen-3.5 specifically requires a block size of 528 to avoid fragmentation, while Gemma 4 functions perfectly with the default 256.

Step 4c: Apply and Verify Deployment

To apply this manifest to your cluster, run:

Loading...

Step 4d: Track Model Download Status

Because large models can take some time to download on first boot, track the initialization logs directly by streaming the container logs:

Bash

Loading...

5. Deploy the PVC Evictor

PVC Evictor Overview

Architecture & Role

The llmd_fs_backend connector offloads KV-cache blocks to Lustre but does not natively delete old cache files. Over time, the cache will fill the shared filesystem. The PVC Evictor acts as an external garbage collector that continuously monitors disk usage and evicts least-recently-used (LRU) files to maintain healthy storage headroom.

Scaling & Sharding

The PVC Evictor supports sharding and can be scaled to multiple replicas to match the capacity and performance of your Lustre instance. As a rule of thumb, you should deploy 1 evictor replica for each 72 TB of Lustre capacity to distribute the eviction load effectively without overwhelming the metadata servers.

For large-scale deployments, the evictor can be configured to run with multiple shards. When running in multi-replica mode, the workload is partitioned across pods, with each pod managing a specific shard of the cache namespace. This prevents redundant metadata scans and race conditions.

High-Performance Resource Requirements

Running the evictor at high scale (e.g., with 16 parallel crawler processes) requires significant CPU and memory resources to handle the rapid scanning and queue management of millions of files. Ensure that the pods are provisioned with sufficient resources (e.g., 12 CPU requests and 8Gi Memory requests) and scheduled on appropriate node types (such as c4-standard-16).

PVC Evictor Deployment Steps

The PVC Evictor is deployed via Helm using the chart located in kv_connectors/pvc_evictor/helm.

Step 5a: Create a Dedicated Node Pool for the Evictor

Running the evictor at high scale requires significant CPU and memory. First, create a dedicated node pool using a high-performance machine type (such as c4-standard-16) to accommodate the 12 CPU and 8Gi memory requests needed per pod.

Loading...

Step 5b: Install via Helm (High-Performance Configuration)

Deploy a scaled, high-performance evictor pool with 2 replicas to monitor lustre-pvc. This configuration uses 16 crawler processes per pod to handle massive file namespaces.

Note on Security Contexts:  To allow the evictor pod to delete files created by vLLM, it must run with matching security context IDs. Ensure the placeholders <YOUR_NON_ROOT_GID> and <YOUR_NON_ROOT_UID> exactly match the non-root values used in the securityContext of your vLLM deployment to ensure shared POSIX file permissions.

Loading...

Critical Parameters Explained:

  • replicaCount=2: Deploys 2 evictor pods. The Helm chart automatically configures sharding (totalShards=2) when multiple replicas are used.
  • config.numCrawlerProcesses=16: Runs 16 parallel crawler threads per pod to scan the filesystem rapidly.
  • config.deletionBatchSize=5000: Deletes files in batches of 5000 to reduce metadata overhead.
  • config.fileQueueMinSize & config.fileQueueMaxsize: Configures large memory queues (1M min, 2M max) to buffer files for deletion, matching the high crawler throughput.
  • config.fileAccessTimeThresholdMinutes=10: Aggressively evicts files that haven't been accessed in the last 10 minutes when the cleanup threshold is triggered.
  • securityContext.container.runAsNonRoot=false: Required if the evictor needs root-like permissions to manage/delete files across different user ownerships on the shared storage.
  • resources.requests & limits: Allocates 12-15 CPUs and 8-16Gi of memory per pod to ensure the high number of crawler processes do not get CPU-throttled or run Out-Of-Memory (OOM).

Step 5c: Verify and Monitor

Loading...

Step 6: Clean Up

Because this deployment provisions significant and high-cost hardware, be sure to clean up your environment when you are done to avoid unnecessary charges.

Bash

Loading...

Appendix: Reference Configuration for Llama-3.3-70B Benchmark

The following configuration is a representation of the deployment manifest used to generate the Llama-3.3-70B benchmark results referenced in this post. It is provided for completeness and transparency.

Note: This configuration utilizes an earlier iteration of the software stack (vLLM v0.15.0) and specific infrastructure flags that were active in the benchmarking environment at the time the data was collected.

Loading...
Posted in