optimization: optimize NodeSpread selection to run in-memory, resolving concurrent latency regression#939
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vicentefb The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
�� WalkthroughWalkthroughQueue deduplication and warm-pool sandbox adoption are refactored: ChangesWarm-Pool Adoption and Queue Refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
extensions/controllers/sandboxclaim_controller_test.go (1)
3967-3973: ⚡ Quick winAdd a regression assertion that skipped unready fallback remains queued.
This case proves selection, but it doesn’t verify that the first unready candidate is re-queued after adopting a later ready one. Please add that assertion to lock in queue-retention behavior.
As per coding guidelines "When fixing a bug, add a regression test that fails without the fix."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/controllers/sandboxclaim_controller_test.go` around lines 3967 - 3973, Add an assertion that the skipped unready sandbox ("sb-old-unready") remains queued after the controller adopts the later ready sandbox; locate the test case using existingSandboxes and createWarmPoolSandboxWithNode and, after asserting expectedAdoptedSandbox == "sb-young-ready", assert that the unready candidate "sb-old-unready" is still in the warm-pool queue (e.g., still present in the queue snapshot or has the queued status/phase used elsewhere in these tests).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@extensions/controllers/sandboxclaim_controller.go`:
- Around line 709-712: When you find a Ready candidate in the block guarded by
isSandboxReady(adopted) and are about to return adopted, adoptedKey, nil, first
requeue any previously saved fallback (the variables fallback and fallbackKey)
back onto the controller's work queue using the same enqueue mechanism this
controller uses (e.g., r.workqueue.Add or the controller's enqueue helper like
r.enqueueSandbox) so it isn't dropped; perform that requeue before the return.
Apply the same pattern to the other early-return at lines 716-722 so any
captured fallback is always requeued before returning a later-ready candidate.
---
Nitpick comments:
In `@extensions/controllers/sandboxclaim_controller_test.go`:
- Around line 3967-3973: Add an assertion that the skipped unready sandbox
("sb-old-unready") remains queued after the controller adopts the later ready
sandbox; locate the test case using existingSandboxes and
createWarmPoolSandboxWithNode and, after asserting expectedAdoptedSandbox ==
"sb-young-ready", assert that the unready candidate "sb-old-unready" is still in
the warm-pool queue (e.g., still present in the queue snapshot or has the queued
status/phase used elsewhere in these tests).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1ba7da81-82e7-43d4-ba2f-d9aaa7e2a50c
📒 Files selected for processing (3)
extensions/controllers/queue/simple_sandbox_queue.goextensions/controllers/sandboxclaim_controller.goextensions/controllers/sandboxclaim_controller_test.go
There was a problem hiding this comment.
Pull request overview
This PR refactors the warm-pool adoption “NodeSpread” strategy to avoid per-reconcile namespace-wide List calls by tracking NodeName directly in the warm-pool queue keys and performing node balancing from an in-memory queue snapshot.
Changes:
- Reworked smart candidate selection to use in-memory queue snapshots for node balancing and FIFO tie-breaking (removing the per-reconcile namespace
List). - Extended
queue.SandboxKeyto includeNodeName, and updated queue deduplication to be based onnamespace/namewhile supporting in-place NodeName updates. - Updated the sandbox watcher to enqueue updates when scheduling information becomes available; refreshed/added unit tests for the new selection behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| extensions/controllers/sandboxclaim_controller.go | Replaces List-based node counting with in-memory selection; updates sandbox watch behavior to keep queue node placement up to date. |
| extensions/controllers/sandboxclaim_controller_test.go | Updates/extends unit tests to validate FIFO/ready filtering and in-memory NodeSpread behavior. |
| extensions/controllers/queue/simple_sandbox_queue.go | Redefines SandboxKey to include NodeName and changes queue dedupe/update semantics to support dynamic scheduling updates. |
remove namespace filtering since it's out of scope address copilot comments
0d94d7e to
ed6b52a
Compare
|
/lgtm |
remove namespace filtering since it's out of scope address copilot comments
remove namespace filtering since it's out of scope address copilot comments
What this PR does / why we need it:
This PR optimizes the NodeSpread sandbox selection strategy to run purely in memory by tracking sandbox node placement directly in the queue keys.
The previous NodeSpread implementation #878 required calling r.List to retrieve all sandboxes in the namespace on every single claim reconciliation to calculate node workloads. Under high concurrent claim adoption (QPS=300, BURST_SIZE=300), this namespace list call triggered up to 270,000 object deep-copies, causing severe CPU/GC thrashing and cache read-lock contention.
This optimization resolves the latency regression by storing NodeName in
queue.SandboxKeyand calculating the node spread in memory from the queue snapshot. This drops the adoption complexity fromO(N)toO(1)local cache reads.Performance Comparison
1. Under Low Load (BURST_SIZE=100, QPS=50, WARMPOOL_SIZE=200)
2. Under High Stress (BURST_SIZE=300, QPS=300, WARMPOOL_SIZE=600, warm-pool-batch-size=1)
Testing with
batch-size=1to eliminate node-level container startup backoff delays.The tests are run based on this framework https://github.com/kubernetes-sigs/agent-sandbox/blob/main/dev/load-test/test-recipes/README-rapid-burst.md
When running high concurrent bursts ((creating 300+ claims concurrently), GKE nodes' default systemd cgroup driver can experience IPC D-Bus socket congestion, causing transient container startup failures (StartError). To prevent this node-level bottleneck, it is recommended to set the controller manager flag:
- "--sandbox-warm-pool-max-batch-size=1"This was the error that would trigger a
StartErrorstate in the pod of a warmpool:Changes
queue.SandboxKey: Redefined the key in simple_sandbox_queue.go to track scheduling node placement (NodeName).synchronizedQueueto track unique items usingNamespace/Namestring keys. This allows updating the scheduled node placement of sandboxes dynamically without pushing duplicate queue items.Pushlogic to blindly copy the incoming key'sNodeNameon updates, guaranteeing the memory queue always reflects the latest node scheduling state.sandboxEventHandlerin sandboxclaim_controller.go to trigger queue updates when a sandbox pod's node scheduling changes (oldSandbox.Status.NodeName != newSandbox.Status.NodeName).r.Listquery ingetCandidatewith a pure in-memorypickSmartselector. It counts the remaining keys per node in the queue snapshot to round-robin adoptions across the least-selected nodes, falling back to FIFO (oldest first) on ties or unscheduled keys.adoptingFallbackflag in the deferred queue-retention logic to robustly requeue the fallback key on all early-return paths except when intentionally adopted.TestSandboxClaimAdoptionStrategyverification in sandboxclaim_controller_test.go covering oldest-ready FIFO sorting, unready candidates filtering, and in-memory NodeSpread balancing, including regression tests for skipped queue retention.Which issue(s) this PR is related to:
#491
Release Note
Summary by CodeRabbit
Improvements
Bug Fixes
Tests