example: scale to zero using KEDA#1048
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
📝 WalkthroughWalkthroughAdds a new ChangesKEDA warm pool scale-to-zero example
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
examples/keda-swp-scaling/python-sandbox-template.yaml (1)
4-5: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winPoint this comment at the actual dependent file.
create-claim.pynever referencespython-sandbox-template; the real contract isexamples/keda-swp-scaling/sandboxwarmpool.yamlviaspec.sandboxTemplateRef.name. As written, the comment sends readers to the wrong file when they rename resources.📝 Suggested fix
- # The create-claim.py expects the template to have this name + # sandboxwarmpool.yaml references this via spec.sandboxTemplateRef.name🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/keda-swp-scaling/python-sandbox-template.yaml` around lines 4 - 5, The comment is attached to the wrong template file and should point to the actual dependency used by create-claim.py. Update the reference so it documents examples/keda-swp-scaling/sandboxwarmpool.yaml and the spec.sandboxTemplateRef.name contract, since that is what the claim script relies on when matching resource names. Keep the note aligned with the real consumer and remove the misleading link to python-sandbox-template.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/keda-swp-scaling/create-claim.py`:
- Around line 28-31: The kubeconfig fallback in create-claim.py is too broad
because the current try/except around config.load_kube_config() catches all
failures and can hide real local config/auth errors. Narrow the catch in that
startup block to ConfigException, and only call config.load_incluster_config()
when KUBERNETES_SERVICE_HOST is set so the fallback happens only in-cluster; use
the existing config.load_kube_config and config.load_incluster_config calls to
locate the change.
- Around line 53-94: The load loop in create_claim.py is counting scheduled
threads as completed claims and can launch unbounded daemon workers. Update
create_claim and the main rate loop so concurrency is bounded with a worker
limit or thread pool, and only increment/report progress after
create_namespaced_custom_object finishes successfully. Keep the progress and
final totals tied to completed claim creations rather than thread starts.
In `@examples/keda-swp-scaling/README.md`:
- Around line 171-175: The Stackdriver IAM example in the README uses a
hardcoded PROJECT_ID inside the principal URI, so update the command in the KEDA
IAM binding example to interpolate the actual $PROJECT_ID consistently. Make the
principal string in the gcloud projects add-iam-policy-binding example match the
same project variable used elsewhere in the snippet so the workload identity
principal resolves correctly for the KEDA operator.
In `@examples/keda-swp-scaling/scaledobject-stackdriver.yaml`:
- Around line 59-60: The Stackdriver ScaledObject’s target setting is
inconsistent with the Prometheus variant, so update the `targetValue` in
`scaledobject-stackdriver.yaml` to match the same claims/sec per replica
threshold used by the Prometheus example. Keep the `ScaledObject` configuration
aligned with the HPA/Prometheus semantics and adjust the nearby comment so it no
longer claims a different value “matches the HPA example.”
- Line 47: The Stackdriver scaledobject manifest currently hardcodes a specific
GCP project ID, so replace the projectId value in the scaledobject-stackdriver
YAML with a placeholder such as PROJECT_ID or YOUR_PROJECT_ID and make sure any
related example references use the same placeholder. Keep the manifest generic
by updating the field in the Stackdriver configuration block, and add a brief
note in the README explaining that users must substitute their own project ID
before applying the example.
---
Nitpick comments:
In `@examples/keda-swp-scaling/python-sandbox-template.yaml`:
- Around line 4-5: The comment is attached to the wrong template file and should
point to the actual dependency used by create-claim.py. Update the reference so
it documents examples/keda-swp-scaling/sandboxwarmpool.yaml and the
spec.sandboxTemplateRef.name contract, since that is what the claim script
relies on when matching resource names. Keep the note aligned with the real
consumer and remove the misleading link to python-sandbox-template.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 19b2de5d-0e53-4ca1-8abb-d9454a23c09e
📒 Files selected for processing (7)
examples/keda-swp-scaling/README.mdexamples/keda-swp-scaling/create-claim.pyexamples/keda-swp-scaling/pod-monitoring.yamlexamples/keda-swp-scaling/python-sandbox-template.yamlexamples/keda-swp-scaling/sandboxwarmpool.yamlexamples/keda-swp-scaling/scaledobject-prometheus.yamlexamples/keda-swp-scaling/scaledobject-stackdriver.yaml
| @@ -0,0 +1,312 @@ | |||
| # SandboxWarmPool Scale-to-Zero with KEDA on GKE | |||
There was a problem hiding this comment.
add it to site show in website
There was a problem hiding this comment.
Pull request overview
This PR adds a new end-to-end example showing how to scale SandboxWarmPool replicas down to zero on GKE using KEDA, and updates controller metrics so cold-start claim creation is labeled with the referenced warm pool name (enabling warmpool_name-scoped scaling queries).
Changes:
- Record cold-start
SandboxClaimcreation metrics withclaim.spec.warmPoolRef.nameinstead of the hardcoded"none". - Update the existing controller test to assert the new
warmpool_namelabel value. - Add a complete
examples/keda-scale-to-zero/walkthrough with manifests and a load generator script.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| extensions/controllers/sandboxclaim_controller.go | Updates cold-start metric recording to use the claim’s referenced warm pool name. |
| extensions/controllers/sandboxclaim_controller_test.go | Adjusts assertions to validate the new warmpool_name label value. |
| examples/keda-scale-to-zero/README.md | Adds an end-to-end guide for KEDA-based scale-to-zero on GKE (GMP + optional Stackdriver path). |
| examples/keda-scale-to-zero/scaledobject-prometheus.yaml | Adds KEDA ScaledObject using the Prometheus scaler against the GMP frontend. |
| examples/keda-scale-to-zero/scaledobject-stackdriver.yaml | Adds an alternative KEDA ScaledObject using Cloud Monitoring (Stackdriver) directly. |
| examples/keda-scale-to-zero/sandboxwarmpool.yaml | Adds a SandboxWarmPool manifest starting at replicas: 0 for KEDA control. |
| examples/keda-scale-to-zero/python-sandbox-template.yaml | Adds a SandboxTemplate manifest for the example warm pool. |
| examples/keda-scale-to-zero/pod-monitoring.yaml | Adds GMP PodMonitoring manifest to scrape controller metrics. |
| examples/keda-scale-to-zero/create-claim.py | Adds a Python script to generate SandboxClaim load (with TTL via lifecycle shutdown time). |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: shrutiyam-glitch The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/keda-scale-to-zero/README.md (1)
1-319: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winAdd a docs page for
keda-scale-to-zeroThe README is mounted intoassets/additional/examples, but there’s nosite/content/docs/use-cases/examples/keda-scale-to-zero/_index.mdor landing-page link, so it won’t show up in the examples nav/index yet.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/keda-scale-to-zero/README.md` around lines 1 - 319, The `keda-scale-to-zero` example is documented only in the README and won’t appear in the site navigation/index yet. Add the missing docs page at `site/content/docs/use-cases/examples/keda-scale-to-zero/_index.md` and wire it into the examples landing page so it is discoverable alongside the other examples; use the existing `examples/keda-scale-to-zero/README.md` content as the source and keep the page title/metadata aligned with the examples section.Source: Path instructions
🧹 Nitpick comments (2)
examples/keda-scale-to-zero/README.md (1)
269-269: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueHyphenate compound modifier for grammar correctness.
"~1 minute window" should be "~1-minute window" (compound modifier before a noun).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/keda-scale-to-zero/README.md` at line 269, Update the wording in the KEDA scale-to-zero README to hyphenate the compound modifier in the sentence containing “~1 minute window.” Adjust the text so the modifier before the noun reads as “~1-minute window,” keeping the rest of the sentence unchanged.Source: Linters/SAST tools
examples/keda-scale-to-zero/create-claim.py (1)
21-26: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winExpose the load knobs instead of hard-coding them.
The PR describes this as a configurable load generator, but rate, duration, and TTL can only be changed by editing the file. Pulling them from env vars would make the example reusable as documented.
Suggested edit
NAMESPACE = os.getenv("NAMESPACE", "keda-test") WARMPOOL = os.getenv("WARM_POOL_NAME", "python-sdk-warmpool") -RATE_PER_SECOND = 5 -TEST_DURATION_MINUTES = 10 -CLAIM_TTL_SECONDS = 60 +RATE_PER_SECOND = int(os.getenv("RATE_PER_SECOND", "5")) +TEST_DURATION_MINUTES = int(os.getenv("TEST_DURATION_MINUTES", "10")) +CLAIM_TTL_SECONDS = int(os.getenv("CLAIM_TTL_SECONDS", "60"))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/keda-scale-to-zero/create-claim.py` around lines 21 - 26, The load-generator settings are still hard-coded in the create_claim.py configuration block, so the example is not actually configurable. Update the top-level constants in create_claim.py (for example RATE_PER_SECOND, TEST_DURATION_MINUTES, and CLAIM_TTL_SECONDS alongside NAMESPACE and WARMPOOL) to read from environment variables with sensible defaults, and make sure the rest of the script uses those symbols so the load knobs can be changed without editing the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/keda-scale-to-zero/create-claim.py`:
- Around line 60-69: The create flow in create_claim() and the surrounding loop
should not fire-and-forget background workers while incrementing totals upfront.
Replace the unbounded thread spawning with a bounded executor or equivalent
ownership mechanism, wait for all submitted create tasks to complete, and count
only successful create_namespaced_custom_object calls as created. Also surface
failures instead of swallowing them in the except block so the caller can
observe RBAC/CRD/config errors and apply backpressure when API latency is high.
In `@examples/keda-scale-to-zero/python-sandbox-template.yaml`:
- Around line 4-5: Update the comment in python-sandbox-template.yaml so it
correctly describes the linkage: the template name is consumed by
sandboxwarmpool.yaml through ${TEMPLATE_NAME}, not directly by create-claim.py.
Adjust the wording near the name field to reference the warm pool manifest and
keep the explanation aligned with the actual consumer, using the unique symbols
TEMPLATE_NAME and sandboxwarmpool.yaml to locate the spot.
In `@examples/keda-scale-to-zero/README.md`:
- Line 111: The troubleshooting guidance in the README uses inconsistent metric
label names for the controller metric exposed by the KEDA scale-to-zero example.
Update the references in the troubleshooting section and the Cloud Console query
example to use the same label as the metric description in the document, and
align the wording around the controller metric exposed by the relevant README
sections such as the metric description and troubleshooting/query examples. If
both labels are intended for different resources, explicitly distinguish them so
users know which label to query when investigating scale-from-zero issues.
---
Outside diff comments:
In `@examples/keda-scale-to-zero/README.md`:
- Around line 1-319: The `keda-scale-to-zero` example is documented only in the
README and won’t appear in the site navigation/index yet. Add the missing docs
page at `site/content/docs/use-cases/examples/keda-scale-to-zero/_index.md` and
wire it into the examples landing page so it is discoverable alongside the other
examples; use the existing `examples/keda-scale-to-zero/README.md` content as
the source and keep the page title/metadata aligned with the examples section.
---
Nitpick comments:
In `@examples/keda-scale-to-zero/create-claim.py`:
- Around line 21-26: The load-generator settings are still hard-coded in the
create_claim.py configuration block, so the example is not actually
configurable. Update the top-level constants in create_claim.py (for example
RATE_PER_SECOND, TEST_DURATION_MINUTES, and CLAIM_TTL_SECONDS alongside
NAMESPACE and WARMPOOL) to read from environment variables with sensible
defaults, and make sure the rest of the script uses those symbols so the load
knobs can be changed without editing the file.
In `@examples/keda-scale-to-zero/README.md`:
- Line 269: Update the wording in the KEDA scale-to-zero README to hyphenate the
compound modifier in the sentence containing “~1 minute window.” Adjust the text
so the modifier before the noun reads as “~1-minute window,” keeping the rest of
the sentence unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3407863e-eb6d-4ca1-b185-73c735f741cd
📒 Files selected for processing (7)
examples/keda-scale-to-zero/README.mdexamples/keda-scale-to-zero/create-claim.pyexamples/keda-scale-to-zero/pod-monitoring.yamlexamples/keda-scale-to-zero/python-sandbox-template.yamlexamples/keda-scale-to-zero/sandboxwarmpool.yamlexamples/keda-scale-to-zero/scaledobject-prometheus.yamlexamples/keda-scale-to-zero/scaledobject-stackdriver.yaml
| try: | ||
| custom_api.create_namespaced_custom_object( | ||
| group="extensions.agents.x-k8s.io", | ||
| version="v1beta1", | ||
| namespace=NAMESPACE, | ||
| plural="sandboxclaims", | ||
| body=body | ||
| ) | ||
| except Exception as e: | ||
| print(f"Error creating {name}: {e}") |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Own the create lifecycle instead of fire-and-forget threads.
This loop increments the total before each background request finishes, suppresses failures inside create_claim(), and never waits for the spawned workers. A bad RBAC/CRD/config setup can therefore report thousands of “claims created” while creating none, and sustained API latency can accumulate thousands of live threads with no backpressure. Use a bounded executor, wait for completion, and count successes separately from attempts. As per coding guidelines, "Concurrency: respect context.Context cancellation; avoid goroutines without lifetime ownership; protect shared state."
Suggested direction
-import threading
+from concurrent.futures import ThreadPoolExecutor, as_completed
@@
def create_claim(index):
@@
try:
custom_api.create_namespaced_custom_object(
@@
body=body
)
+ return True
except Exception as e:
print(f"Error creating {name}: {e}")
+ return False
@@
- counter = 0
+ attempts = 0
+ successes = 0
@@
- while time.time() < end_time:
- # Fire and forget the creation in a thread to avoid blocking the clock
- threading.Thread(target=create_claim, args=(counter,), daemon=True).start()
- counter += 1
+ futures = []
+ with ThreadPoolExecutor(max_workers=RATE_PER_SECOND * 2) as executor:
+ while time.time() < end_time:
+ futures.append(executor.submit(create_claim, attempts))
+ attempts += 1
@@
- if counter % 10 == 0:
- print(f"Progress: {counter} claims created...")
+ if attempts % 10 == 0:
+ print(f"Progress: {attempts} claim attempts scheduled...")
+
+ for future in as_completed(futures):
+ if future.result():
+ successes += 1
@@
- print(f"Load test complete. Total claims created: {counter}")
+ print(f"Load test complete. {successes}/{attempts} claims created successfully.")Also applies to: 80-103
🧰 Tools
🪛 Ruff (0.15.20)
[warning] 68-68: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/keda-scale-to-zero/create-claim.py` around lines 60 - 69, The create
flow in create_claim() and the surrounding loop should not fire-and-forget
background workers while incrementing totals upfront. Replace the unbounded
thread spawning with a bounded executor or equivalent ownership mechanism, wait
for all submitted create tasks to complete, and count only successful
create_namespaced_custom_object calls as created. Also surface failures instead
of swallowing them in the except block so the caller can observe RBAC/CRD/config
errors and apply backpressure when API latency is high.
Sources: Coding guidelines, Linters/SAST tools
| # The create-claim.py expects the template to have this name | ||
| name: ${TEMPLATE_NAME} |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Fix the name-linkage comment.
create-claim.py never references the template directly; it targets the warm pool, and examples/keda-scale-to-zero/sandboxwarmpool.yaml is the manifest that consumes ${TEMPLATE_NAME}. As written, this comment points readers at the wrong object to keep in sync.
Suggested edit
- # The create-claim.py expects the template to have this name
+ # sandboxwarmpool.yaml refers to this template via spec.sandboxTemplateRef.name📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # The create-claim.py expects the template to have this name | |
| name: ${TEMPLATE_NAME} | |
| # sandboxwarmpool.yaml refers to this template via spec.sandboxTemplateRef.name | |
| name: ${TEMPLATE_NAME} |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/keda-scale-to-zero/python-sandbox-template.yaml` around lines 4 - 5,
Update the comment in python-sandbox-template.yaml so it correctly describes the
linkage: the template name is consumed by sandboxwarmpool.yaml through
${TEMPLATE_NAME}, not directly by create-claim.py. Adjust the wording near the
name field to reference the warm pool manifest and keep the explanation aligned
with the actual consumer, using the unique symbols TEMPLATE_NAME and
sandboxwarmpool.yaml to locate the spot.
|
|
||
| 3. **Expose the controller metric via GKE Managed Service for Prometheus**: | ||
| Apply the `pod-monitoring.yaml` to scrape the controller's `/metrics` endpoint. This exposes | ||
| `agent_sandbox_claim_creation_total{warmpool_name="..."}` into GMP. |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Fix inconsistent metric label names in troubleshooting guidance.
The document states the controller exposes agent_sandbox_claim_creation_total{warmpool_name="..."} (Line 111) and explains that the controller records warmpool_name from spec.warmPoolRef.name (Line 265). However, the troubleshooting section tells users to "verify you filtered on sandbox_template" (Line 280) and the Cloud Console query example uses {sandbox_template="$TEMPLATE_NAME"} (Line 305). These are contradictory — warmpool_name and sandbox_template are different labels on different resources. Users following the troubleshooting steps will query a non-existent label and fail to diagnose scale-from-zero issues.
Use warmpool_name consistently throughout, or if the metric actually carries both labels, clarify which label is used for which purpose.
Also applies to: 265-265, 280-280, 305-305
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/keda-scale-to-zero/README.md` at line 111, The troubleshooting
guidance in the README uses inconsistent metric label names for the controller
metric exposed by the KEDA scale-to-zero example. Update the references in the
troubleshooting section and the Cloud Console query example to use the same
label as the metric description in the document, and align the wording around
the controller metric exposed by the relevant README sections such as the metric
description and troubleshooting/query examples. If both labels are intended for
different resources, explicitly distinguish them so users know which label to
query when investigating scale-from-zero issues.
What this PR does / why we need it:
This pull request introduces a complete end-to-end guide and resource templates demonstrating how to implement scale-to-zero capabilities for warm sandbox pools on GKE using KEDA.
By default, warm pools must balance active instances with resource consumption. This example provides ready-to-use configurations to dynamically scale warm pools based on claim rates, allowing them to scale down to zero when idle.
Which issue(s) this PR is related to:
Ref: #677
Related issues: #1050
Release Note
Summary by CodeRabbit