This project provides a suite of tools for simulating and optimizing GKE (Google Kubernetes Engine) capacity strategies, specifically focusing on balancing latency targets with infrastructure costs using active and standby buffers.
The simulator allows you to model how different buffer configurations (active vs. standby) handle bursts of traffic. It provides detailed analytics on pod startup latencies and cost implications.
- Active Buffer: Pods ready to serve traffic immediately (100% cost).
- Standby Buffer: Pods that can be resumed quickly (e.g., 5% cost, ~30s startup).
- Features:
- Real-time simulation of pod allocation.
- Latency distribution charts (P50, P90, P95, P99).
- Infrastructure inventory tracking over time.
- Persistent state via URL hash for easy sharing.
The optimizer helps you find the most cost-effective buffer configuration that meets your specific latency targets.
- Inputs: Traffic patterns, standby resume latency, and scale-up latency.
- Targets: Define maximum acceptable latencies for P50, P90, P95, and P99 percentiles.
- Output: Identifies the "Best" configuration and provides a cost curve of all valid candidates.
This simulator models a tiered capacity strategy designed for high-scale, bursty workloads (e.g., large-scale batch processing or rapid scaling events) in Google Kubernetes Engine.
Instead of relying solely on cold starts (which are slow) or over-provisioning (which is expensive), this approach uses three distinct tiers:
- Active Buffer (Hot): Fully provisioned and running pods. Requests hitting this buffer experience instant (1s) startup latency. This is the most expensive tier (100% cost).
- Standby Buffer (Warm): Pods that are provisioned but in a "suspended" or "idled" state. When a request hits this buffer, the pod must be "resumed." This incurs a Resume Latency (typically 30s) but at a significantly reduced cost (e.g., 5% of an active pod).
- Scale-up (Cold): When both buffers are exhausted, GKE must provision new infrastructure. This incurs the full Scale-up Latency (typically 60s+).
The simulation uses a discrete-event model to track pod availability, latency, and cost over time based on defined target capacities for the Active Buffer and Standby Buffer.
For every incoming pod request:
- Active Pool: If an idle pod is available in the Active Buffer, it is used immediately (1s latency).
- Standby Pool: If the Active Buffer is empty, a pod is pulled from the Standby Buffer (prioritizing "warm" resumed pods, then "suspended" pods). Using a suspended pod incurs Standby Resume Latency (~30s).
- Scale-up: If both buffers are exhausted, a new pod is provisioned via Scale-up Latency (~60s).
The simulation maintains the target amounts for Active and Standby buffers through a continuous backfilling process:
- Active Buffer Backfilling: When a pod is taken from the Active Buffer to serve a request, the system immediately attempts to backfill it to its target capacity. This backfill is pulled from the Standby Buffer if available, or triggers a Scale-up if the Standby Buffer is also empty.
- Standby Buffer Backfilling: When the Standby Buffer is used, it triggers a Scale-up to replenish it back to its defined target amount.
- Suspension Logic:
- Active Buffer Pods: These pods are always active and never become suspended.
- Standby Buffer Pods: Once a Scale-up is complete and the pod enters the Standby Buffer, the
Idle Timeout(e.g., 5 minutes) begins. If the pod remains unused in the Standby Buffer for the duration of this timeout, it becomes suspended (reducing its cost to the standby rate).
Total cost is calculated as a time-weighted average based on the state of every pod in the system:
- Active State: Pods in the Active Buffer or currently serving requests (100% cost).
- Standby/Suspended State: Pods in the Standby Buffer that have passed the idle timeout (e.g., 5% cost).
- Warming/Resuming State: Pods transitioning from suspended or being scaled up (typically 100% cost during the transition).
The Capacity Optimizer uses an advanced search algorithm to find the cheapest buffer configuration that meets your latency targets.
Since simulating every possible combination of Active and Standby buffers on a fine grid would be too slow, the optimizer uses a Contour Tracing (or Staircase) Algorithm:
- Monotonicity: The system is monotonic: adding more resources can only reduce or maintain latency. This means the region of valid configurations in the (Active, Standby) space is connected and has a clear boundary.
- Binary Search Initialization: The algorithm starts at Active = 0 and uses binary search to find the minimum Standby size that satisfies all targets.
- Boundary Tracing: From the initial valid point, the algorithm traces the boundary by exploring the minimal standby for each active count, effectively tracing the Pareto front.
npm installnpm run devnpm run build- Framework: Vite
- Language: TypeScript
- Styling: Tailwind CSS
- Charts: Chart.js
- Persistence: LocalStorage and URL Hash.
This project is licensed under the Apache 2.0 License.
We welcome contributions! Please see docs/contributing.md for more information.
We follow Google's Open Source Community Guidelines.
This is not an officially supported Google product.