Skip to content

optimization: optimize NodeSpread selection to run in-memory, resolving concurrent latency regression#939

Merged
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
vicentefb:improveWarmpoolSelection
Jun 5, 2026
Merged

optimization: optimize NodeSpread selection to run in-memory, resolving concurrent latency regression#939
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
vicentefb:improveWarmpoolSelection

Conversation

@vicentefb

@vicentefb vicentefb commented Jun 5, 2026

Copy link
Copy Markdown
Member

What this PR does / why we need it:

This PR optimizes the NodeSpread sandbox selection strategy to run purely in memory by tracking sandbox node placement directly in the queue keys.

The previous NodeSpread implementation #878 required calling r.List to retrieve all sandboxes in the namespace on every single claim reconciliation to calculate node workloads. Under high concurrent claim adoption (QPS=300, BURST_SIZE=300), this namespace list call triggered up to 270,000 object deep-copies, causing severe CPU/GC thrashing and cache read-lock contention.

This optimization resolves the latency regression by storing NodeName in queue.SandboxKey and calculating the node spread in memory from the queue snapshot. This drops the adoption complexity from O(N) to O(1) local cache reads.

Performance Comparison

1. Under Low Load (BURST_SIZE=100, QPS=50, WARMPOOL_SIZE=200)

Metric / Percentile main Branch (Unoptimized) This Branch (Optimized) Performance Delta (Speedup)
P50 Latency 123.2 ms 98.5 ms 1.2x faster
P90 Latency 1,551.5 ms (1.5s) 219.5 ms 7.0x faster
P99 Latency 9,730.6 ms (9.7s) 246.9 ms 39.4x faster (~4000% speedup)

2. Under High Stress (BURST_SIZE=300, QPS=300, WARMPOOL_SIZE=600, warm-pool-batch-size=1)

Testing with batch-size=1 to eliminate node-level container startup backoff delays.

Metric / Percentile main Branch (Unoptimized) This Branch (Optimized) Performance Delta (Speedup)
P50 Latency 244.8 ms 173.4 ms 1.4x faster
P90 Latency 481.1 ms 237.7 ms 2.0x faster
P99 Latency 1,521.4 ms (1.52s) 393.4 ms (0.39s) 3.9x faster (~400% speedup)
        args:
        - "--leader-elect=true"
        - "--enable-tracing=true"
        - --zap-log-level=debug
        - --zap-encoder=json
        - --enable-pprof-debug
        - --kube-api-qps=1000
        - --kube-api-burst=2000
        - --sandbox-concurrent-workers=400
        - --sandbox-claim-concurrent-workers=400
        - --sandbox-warm-pool-concurrent-workers=1
        - --sandbox-warm-pool-max-batch-size=1

The tests are run based on this framework https://github.com/kubernetes-sigs/agent-sandbox/blob/main/dev/load-test/test-recipes/README-rapid-burst.md

When running high concurrent bursts ((creating 300+ claims concurrently), GKE nodes' default systemd cgroup driver can experience IPC D-Bus socket congestion, causing transient container startup failures (StartError). To prevent this node-level bottleneck, it is recommended to set the controller manager flag: - "--sandbox-warm-pool-max-batch-size=1"

This was the error that would trigger a StartError state in the pod of a warmpool:

spec.containers{python-agent}: Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set unit properties: the maximum number of pending messages for service per connection has been reached

Changes

  • Redefined queue.SandboxKey: Redefined the key in simple_sandbox_queue.go to track scheduling node placement (NodeName).
  • Unique ID Queue Deduplication: Updated synchronizedQueue to track unique items using Namespace/Name string keys. This allows updating the scheduled node placement of sandboxes dynamically without pushing duplicate queue items.
  • Stale placement prevention: Updated queue's Push logic to blindly copy the incoming key's NodeName on updates, guaranteeing the memory queue always reflects the latest node scheduling state.
  • Dynamic Scheduling Events: Updated the watcher's sandboxEventHandler in sandboxclaim_controller.go to trigger queue updates when a sandbox pod's node scheduling changes (oldSandbox.Status.NodeName != newSandbox.Status.NodeName).
  • In-Memory Node-Spreading: Replaced the expensive r.List query in getCandidate with a pure in-memory pickSmart selector. It counts the remaining keys per node in the queue snapshot to round-robin adoptions across the least-selected nodes, falling back to FIFO (oldest first) on ties or unscheduled keys.
  • Idempotency & Requeue Safety (Copilot / CodeRabbit Fix): Fixed a queue leak where a parked unready sandbox fallback key would be permanently dropped from the queue if the loop exited early due to error or returned a ready candidate. Introduced a boolean adoptingFallback flag in the deferred queue-retention logic to robustly requeue the fallback key on all early-return paths except when intentionally adopted.
  • Unit Tests: Added a dedicated TestSandboxClaimAdoptionStrategy verification in sandboxclaim_controller_test.go covering oldest-ready FIFO sorting, unready candidates filtering, and in-memory NodeSpread balancing, including regression tests for skipped queue retention.

Which issue(s) this PR is related to:

#491

Release Note

- **extensions/controllers**: Optimized the NodeSpread warm pool sandbox selection strategy to run purely in memory, completely eliminating namespace list API/cache overhead and improving P99 concurrent claim latency by up to 4x.
- **extensions/queue**: Redefined SandboxKey to track scheduled node placement and updated thread-safe queue deduplication to handle dynamic scheduling updates.

Summary by CodeRabbit

  • Improvements

    • Enhanced warm-pool sandbox selection with better node-balancing and FIFO tie-breaking for more even placement and predictable selection.
    • Improved queue deduplication to keep queued entries consistent when multiple placements exist.
  • Bug Fixes

    • Fixed inconsistent removal and dequeue behavior when multiple placements for the same sandbox exist.
  • Tests

    • Updated and expanded selection and adoption tests to cover ordering, node-spread, and retry scenarios.
@netlify

netlify Bot commented Jun 5, 2026

Copy link
Copy Markdown

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit ed6b52a
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/6a227beadaac260008c45fd0
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vicentefb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from barney-s and janetkuo June 5, 2026 06:33
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 5, 2026
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 58a72d39-7e66-4d62-866b-099995f393a3

📥 Commits

Reviewing files that changed from the base of the PR and between 0d94d7e and ed6b52a.

📒 Files selected for processing (3)
  • extensions/controllers/queue/simple_sandbox_queue.go
  • extensions/controllers/sandboxclaim_controller.go
  • extensions/controllers/sandboxclaim_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • extensions/controllers/queue/simple_sandbox_queue.go
  • extensions/controllers/sandboxclaim_controller.go
  • extensions/controllers/sandboxclaim_controller_test.go

�� Walkthrough

Walkthrough

Queue deduplication and warm-pool sandbox adoption are refactored: SandboxKey now includes NodeName; queue deduplication uses namespace/name only, allowing node updates on duplicates; candidate selection applies node-spread heuristics with FIFO tie-breaking instead of API enumeration; event handling enqueues unscheduled-to-scheduled transitions.

Changes

Warm-Pool Adoption and Queue Refactor

Layer / File(s) Summary
SandboxKey Structure and Queue Deduplication Contract
extensions/controllers/queue/simple_sandbox_queue.go
SandboxKey struct adds explicit Namespace, Name, NodeName fields; internal queue synchronizedQueue uses a map[string]struct{} keyed by namespace/name uniqueID to deduplicate; Push updates existing queued item's NodeName when a duplicate namespace/name is pushed.
Queue Operations with Namespace/Name Deduplication
extensions/controllers/queue/simple_sandbox_queue.go
Remove, Pop, and PopWithStrategy compute uniqueID from Namespace/Name for presence checks, match/remove items in the slice by Namespace/Name instead of full key equality, and clean up the dedupe set on removal.
Warm-Pool Adoption Strategy Refactor
extensions/controllers/sandboxclaim_controller.go
getCandidate introduces fallbackSandbox and replaces API enumeration + smartSelector with an in-memory pickSmart callback that groups queued keys by NodeName (scheduled vs unscheduled), applies node-spread selection with per-node queue counts and FIFO tie-breaking among ready sandboxes, tracks a single fallback for the first valid-but-not-ready candidate, and requeues cross-namespace candidates. sandboxEventHandler now enqueues transitions where a sandbox becomes node-scheduled including NodeName in the queued key. The smartSelector type was removed.
Test Coverage for Adoption and Queue Changes
extensions/controllers/sandboxclaim_controller_test.go
Tests seed warm-pool queues with NodeName in queue.SandboxKey; TestSandboxClaimAdoptionStrategy is reworked to cover FIFO ready ordering, skipping unready sandboxes, node-spread balancing, and asserts expected remaining queued keys after reconciliation. Several expectations adjusted and TestSmartSelector removed.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • kubernetes-sigs/agent-sandbox#878: Both PRs refactor warm-pool sandbox adoption to be node- and readiness-aware by evolving queue keying/deduplication and candidate selection strategy.

Suggested labels

ok-to-test, size/XL, ready-for-review

Suggested reviewers

  • justinsb
  • igooch

Poem

🐰 I nudged the queue with whiskered grace,

each sandbox now knows its destined place.
Namespace binds, NodeName may change,
smart picks spread loads across the range.
Hop, adopt, and balance — springtime for the space!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main optimization: moving NodeSpread selection to in-memory processing to resolve concurrent latency regression.
Description check ✅ Passed The description is comprehensive and well-structured, covering purpose, performance data, detailed changes, related issue, and release notes as required by the template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 5, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
extensions/controllers/sandboxclaim_controller_test.go (1)

3967-3973: ⚡ Quick win

Add a regression assertion that skipped unready fallback remains queued.

This case proves selection, but it doesn’t verify that the first unready candidate is re-queued after adopting a later ready one. Please add that assertion to lock in queue-retention behavior.

As per coding guidelines "When fixing a bug, add a regression test that fails without the fix."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/controllers/sandboxclaim_controller_test.go` around lines 3967 -
3973, Add an assertion that the skipped unready sandbox ("sb-old-unready")
remains queued after the controller adopts the later ready sandbox; locate the
test case using existingSandboxes and createWarmPoolSandboxWithNode and, after
asserting expectedAdoptedSandbox == "sb-young-ready", assert that the unready
candidate "sb-old-unready" is still in the warm-pool queue (e.g., still present
in the queue snapshot or has the queued status/phase used elsewhere in these
tests).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@extensions/controllers/sandboxclaim_controller.go`:
- Around line 709-712: When you find a Ready candidate in the block guarded by
isSandboxReady(adopted) and are about to return adopted, adoptedKey, nil, first
requeue any previously saved fallback (the variables fallback and fallbackKey)
back onto the controller's work queue using the same enqueue mechanism this
controller uses (e.g., r.workqueue.Add or the controller's enqueue helper like
r.enqueueSandbox) so it isn't dropped; perform that requeue before the return.
Apply the same pattern to the other early-return at lines 716-722 so any
captured fallback is always requeued before returning a later-ready candidate.

---

Nitpick comments:
In `@extensions/controllers/sandboxclaim_controller_test.go`:
- Around line 3967-3973: Add an assertion that the skipped unready sandbox
("sb-old-unready") remains queued after the controller adopts the later ready
sandbox; locate the test case using existingSandboxes and
createWarmPoolSandboxWithNode and, after asserting expectedAdoptedSandbox ==
"sb-young-ready", assert that the unready candidate "sb-old-unready" is still in
the warm-pool queue (e.g., still present in the queue snapshot or has the queued
status/phase used elsewhere in these tests).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1ba7da81-82e7-43d4-ba2f-d9aaa7e2a50c

📥 Commits

Reviewing files that changed from the base of the PR and between ea6f51a and 0d94d7e.

📒 Files selected for processing (3)
  • extensions/controllers/queue/simple_sandbox_queue.go
  • extensions/controllers/sandboxclaim_controller.go
  • extensions/controllers/sandboxclaim_controller_test.go
Comment thread extensions/controllers/sandboxclaim_controller.go

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the warm-pool adoption “NodeSpread” strategy to avoid per-reconcile namespace-wide List calls by tracking NodeName directly in the warm-pool queue keys and performing node balancing from an in-memory queue snapshot.

Changes:

  • Reworked smart candidate selection to use in-memory queue snapshots for node balancing and FIFO tie-breaking (removing the per-reconcile namespace List).
  • Extended queue.SandboxKey to include NodeName, and updated queue deduplication to be based on namespace/name while supporting in-place NodeName updates.
  • Updated the sandbox watcher to enqueue updates when scheduling information becomes available; refreshed/added unit tests for the new selection behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
extensions/controllers/sandboxclaim_controller.go Replaces List-based node counting with in-memory selection; updates sandbox watch behavior to keep queue node placement up to date.
extensions/controllers/sandboxclaim_controller_test.go Updates/extends unit tests to validate FIFO/ready filtering and in-memory NodeSpread behavior.
extensions/controllers/queue/simple_sandbox_queue.go Redefines SandboxKey to include NodeName and changes queue dedupe/update semantics to support dynamic scheduling updates.
Comment thread extensions/controllers/sandboxclaim_controller.go Outdated
Comment thread extensions/controllers/sandboxclaim_controller.go
Comment thread extensions/controllers/sandboxclaim_controller.go
Comment thread extensions/controllers/queue/simple_sandbox_queue.go
Comment thread extensions/controllers/sandboxclaim_controller.go
remove namespace filtering since it's out of scope

address copilot comments
@vicentefb vicentefb force-pushed the improveWarmpoolSelection branch from 0d94d7e to ed6b52a Compare June 5, 2026 07:34
@aditya-shantanu

Copy link
Copy Markdown
Collaborator

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2026
@k8s-ci-robot k8s-ci-robot merged commit f4bbb02 into kubernetes-sigs:main Jun 5, 2026
12 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in Agent Sandbox Jun 5, 2026
khirotaka pushed a commit to khirotaka/agent-sandbox that referenced this pull request Jun 12, 2026
remove namespace filtering since it's out of scope

address copilot comments
alexatakvelon pushed a commit to volatilemolotov/agent-sandbox that referenced this pull request Jun 24, 2026
remove namespace filtering since it's out of scope

address copilot comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ready-for-review size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

5 participants