optimization: replace .Update() with .Patch() for claim updateStatus by vicentefb · Pull Request #508 · kubernetes-sigs/agent-sandbox

vicentefb · 2026-04-02T19:00:52Z

In an effort to reduce "Operation cannot be fulfilled..." conflicts at scale, this PR switches to patching to the status of Sandbox Claim resrouce status.

Tests from main without this change indicate:

322 operation cannot be fulfilled conflicts from sandboxclaim (protoPayload.resourceName="pods/sandboxclaim-" OR protoPayload.resourceName="sandboxclaims/")

With this change, 0 conflicts.

Test paramters:

# BURST_SIZE * TOTAL_BURSTS = Total sandbox claims created
BURST_SIZE=300
QPS=300
TOTAL_BURSTS=5
WARMPOOL_SIZE=600
RUNTIME_CLASS="" # Change to "gvisor" if your cluster supports it

Deployment args

        args:
        - "--leader-elect=true"
        - "--extensions"
        - "--enable-tracing=true"
        - --zap-log-level=debug
        - --zap-encoder=json
        - --enable-pprof-debug
        - --kube-api-qps=1000
        - --kube-api-burst=2000
        - --sandbox-concurrent-workers=400
        - --sandbox-claim-concurrent-workers=400
        - --sandbox-warm-pool-concurrent-workers=1

netlify · 2026-04-02T19:00:59Z

✅ Deploy Preview for agent-sandbox canceled.

Name	Link
🔨 Latest commit	`eab585b`
🔍 Latest deploy log	https://app.netlify.com/projects/agent-sandbox/deploys/6a0c92af181e3200087887c2

justinsb · 2026-04-02T19:34:51Z

Let's have the "are there two controllers acting on the same object" discussion on #509

vicentefb · 2026-04-03T23:33:06Z

I ran two very small tests by having 1 claim with a warmpool of size 2.

BURST_SIZE=1
QPS=1
TOTAL_BURSTS=10
WARMPOOL_SIZE=2

From oss main:

Using Update: It tries to update the status at 6. Just later (9), it tries to update it again and hits a 409 Conflict because the ResourceVersion shifted.

Also something to note is that there's a pod collision in 26 due to this update. The controller is trying to update the pod's labels while the Kubelet is simultaneously updating the pod's status. Separate issue where could use .Patch()

Step	Delta (ms)	Method	Target Resource	Result / Notes
1	0	PATCH	sandboxclaims/agent-claim-4	🟢 Test Runner injects Claim
2	17	UPDATE	sandboxes/warmpool-0-6qkhx	🟢 Claim Controller adopts Sandbox
3	28	DELETE	networkpolicies/agent-claim-4-network-policy	🟡 404 Not Found
4	32	CREATE	events/agent-claim-4...	🟢 K8s records adoption event
5	42	CREATE	sandboxes	🟢 WarmPool orders replacement
6	45	UPDATE	sandboxclaims/agent-claim-4/status	🟢 Claim Controller updates status
7	56	DELETE	networkpolicies/agent-claim-4-network-policy	🟡 404 Not Found (Retry)
8	58	PATCH	sandboxwarmpools/warmpool-0/status	🟢 WarmPool updates pool status
9	67	UPDATE	sandboxclaims/agent-claim-4/status	🔴 409 Conflict (Claim collides w/ itself)
10	68	UPDATE	pods/warmpool-0-6qkhx	🟢 Sandbox Controller updates pod
11	71	PATCH	sandboxes/warmpool-0-qcblt	🟢 WarmPool configures replacement
12	77	DELETE	networkpolicies/agent-claim-4-network-policy	🟡 404 Not Found (Retry)
13	85	PATCH	sandboxwarmpools/warmpool-0/status	🟢 WarmPool updates pool status
14	90	PATCH	sandboxes/warmpool-0-6qkhx/status	🟢 SUCCESS: Sandbox Status patched
15	97	PATCH	sandboxwarmpools/warmpool-0/status	🟢 WarmPool updates pool status
16	111	DELETE	networkpolicies/agent-claim-4-network-policy	🟡 404 Not Found (Retry)
17	127	CREATE	pods/warmpool-0-qcblt	🟢 Sandbox Controller provisions pod
18	129	UPDATE	sandboxclaims/agent-claim-4/status	🟢 Retry Succeeds: Claim status resolves
19	132	UPDATE	pods/warmpool-0-6qkhx	🟢 Sandbox Controller updates pod
20	147	PATCH	sandboxes/warmpool-0-6qkhx/status	🟢 SUCCESS: Sandbox Status patched again
21	150	DELETE	networkpolicies/agent-claim-4-network-policy	🟡 404 Not Found (Retry)
22	152	PATCH	sandboxes/warmpool-0-qcblt	🟢 WarmPool configures replacement
23	166	CREATE	services/warmpool-0-qcblt	🟢 Service provisioned
24	191	UPDATE	pods/warmpool-0-6qkhx	🟢 Sandbox Controller updates pod
25	193	PATCH	sandboxes/warmpool-0-qcblt/status	🟢 SUCCESS: New Sandbox Status patched
26	244	UPDATE	pods/warmpool-0-qcblt	🔴 409 Conflict (Sandbox collides w/ Kubelet)
27	264	PATCH	sandboxes/warmpool-0-qcblt/status	🟢 SUCCESS: New Sandbox Status patched again
28	312	UPDATE	pods/warmpool-0-qcblt	🟢 Sandbox Controller updates pod
29	361	UPDATE	pods/warmpool-0-qcblt	🟢 Sandbox Controller updates pod
30	380	PATCH	sandboxes/warmpool-0-qcblt/status	🟢 SUCCESS: New Sandbox Status patched again
31	715	UPDATE	pods/warmpool-0-qcblt	🟢 Sandbox Controller updates pod
32	734	PATCH	sandboxes/warmpool-0-qcblt/status	🟢 SUCCESS: New Sandbox Status patched again
33	758	PATCH	sandboxwarmpools/warmpool-0/status	🟢 WarmPool updates pool status
34	796	UPDATE	pods/warmpool-0-qcblt	🟢 Sandbox Controller updates pod

With this PR:

Using Patch: It fires four separate status updates to two different sandboxes (6, 9, 14, 18, and 21.). Even though the rest of the cluster is actively changing metadata and provisioning resources, every single one of those patches succeeds with zero conflicts.

Step	Delta (ms)	Method	Target Resource	Result / Notes
1	0	PATCH	sandboxclaims/agent-claim-4	🟢 Start: Test Runner injects Claim 4
2	19	UPDATE	sandboxes/warmpool-0-8wljs	🟢 Claim Controller adopts Sandbox
3	31	DELETE	networkpolicies/agent-claim-4-network-policy	🟡 404 Not Found
4	37	CREATE	events/agent-claim-4.18a...	🟢 K8s records Sandbox adoption event
5	45	CREATE	sandboxes	🟢 WarmPool orders replacement Sandbox
6	49	PATCH	sandboxclaims/agent-claim-4/status	🟢 SUCCESS: Claim Status Patched
7	60	DELETE	networkpolicies/agent-claim-4-network-policy	🟡 404 Not Found (Reconcile retry)
8	63	PATCH	sandboxwarmpools/warmpool-0/status	🟢 WarmPool updates pool status
9	70	PATCH	sandboxclaims/agent-claim-4/status	🟢 SUCCESS: Claim Status Patched
10	74	PATCH	sandboxes/warmpool-0-7drtx	🟢 WarmPool configures replacement sandbox
11	81	UPDATE	pods/warmpool-0-8wljs	🟢 Sandbox Controller updates adopted pod
12	81	DELETE	networkpolicies/agent-claim-4-network-policy	🟡 404 Not Found (Reconcile retry)
13	92	PATCH	sandboxwarmpools/warmpool-0/status	🟢 WarmPool updates pool status
14	100	PATCH	sandboxes/warmpool-0-8wljs/status	🟢 Sandbox Controller updates status
15	116	DELETE	networkpolicies/agent-claim-4-network-policy	🟡 404 Not Found (Reconcile retry)
16	116	PATCH	sandboxwarmpools/warmpool-0/status	🟢 WarmPool updates pool status
17	126	CREATE	pods/warmpool-0-7drtx	🟢 Sandbox Controller provisions replacement pod
18	136	PATCH	sandboxclaims/agent-claim-4/status	🟢 SUCCESS: Claim Status Patched
19	140	UPDATE	pods/warmpool-0-8wljs	🟢 Sandbox Controller updates adopted pod
20	145	PATCH	sandboxes/warmpool-0-7drtx	🟢 WarmPool configures replacement sandbox
21	153	PATCH	sandboxes/warmpool-0-8wljs/status	🟢 Sandbox Controller updates status
22	156	DELETE	networkpolicies/agent-claim-4-network-policy	🟡 404 Not Found (Reconcile retry)
23	159	CREATE	services/warmpool-0-7drtx	🟢 Service provisioned for new pod
24	177	PATCH	sandboxes/warmpool-0-7drtx/status	🟢 Sandbox Controller updates new pod status
25	186	UPDATE	pods/warmpool-0-8wljs	🟢 Sandbox Controller updates adopted pod
26	217	UPDATE	pods/warmpool-0-7drtx	🔴 409 Conflict (Sandbox Controller collides w/ Kubelet on Pod)
27	236	PATCH	sandboxes/warmpool-0-7drtx/status	🟢 Sandbox Controller updates status
28	279	UPDATE	pods/warmpool-0-7drtx	🟢 Retry Succeeds: Pod updated.
29	337	PATCH	sandboxes/warmpool-0-7drtx/status	🟢 Sandbox Controller updates status
30	932	UPDATE	pods/warmpool-0-7drtx	🟢 Sandbox Controller updates replacement pod
31	950	PATCH	sandboxes/warmpool-0-7drtx/status	🟢 Sandbox Controller updates status
32	975	PATCH	sandboxwarmpools/warmpool-0/status	🟢 WarmPool updates pool status
33	999	UPDATE	pods/warmpool-0-7drtx	🟢 Sandbox Controller updates replacement pod

Copilot

Pull request overview

This PR aims to reduce high-frequency Operation cannot be fulfilled... write conflicts by switching SandboxClaim status updates from Status().Update(...) to a merge Status().Patch(...) in the SandboxClaim controller reconcile flow.

Changes:

Replace r.Status().Update(ctx, claim) with r.Status().Patch(ctx, claim, client.MergeFrom(oldClaim)) for SandboxClaim status updates.
Add additional status patch scaffolding (oldClaim copy) and a success log line.

bittermandel · 2026-05-18T15:44:42Z

Any chance we can merge this? I believe I'm hitting some issues as we dont immediately retry on failed update, so some claims become ready quite slow

vicentefb · 2026-05-18T16:07:33Z

I can update the PR, in the meantime do you mind please sharing under what circumstances are you seeing your issue of:

we dont immediately retry on failed update, so some claims become ready quite slow

@bittermandel thanks!

barney-s · 2026-05-18T19:51:42Z

Have we identified the rootcause of the conflict in the first place @vicentefb
We know the patch helps but what is causing it in the first place.

bittermandel · 2026-05-18T19:57:03Z

I can update the PR, in the meantime do you mind please sharing under what circumstances are you seeing your issue of:

we dont immediately retry on failed update, so some claims become ready quite slow

@bittermandel thanks!

we return the err here immediately with an empty result, so there's no guaranteed reconcile immediately after.
https://github.com/vicentefb/agent-sandbox/blob/203902d7f295e2e5ea23e717149db716ac9ff1f9/extensions/controllers/sandboxclaim_controller.go#L165

bittermandel · 2026-05-18T20:11:28Z

Have we identified the rootcause of the conflict in the first place @vicentefb We know the patch helps but what is causing it in the first place.

I believe this comes from us initializing annotations here, then doing an update immediately with perhaps a state object: https://github.com/vicentefb/agent-sandbox/blob/cff4413a8d9e72dcefacf1fb03b03474c944faf1/extensions/controllers/sandboxclaim_controller.go#L169C2-L172C3

barney-s · 2026-05-18T20:24:51Z

I can update the PR, in the meantime do you mind please sharing under what circumstances are you seeing your issue of:

we dont immediately retry on failed update, so some claims become ready quite slow

@bittermandel thanks!

we return the err here immediately with an empty result, so there's no guaranteed reconcile immediately after. https://github.com/vicentefb/agent-sandbox/blob/203902d7f295e2e5ea23e717149db716ac9ff1f9/extensions/controllers/sandboxclaim_controller.go#L165

the controller-runtime requeues on error. But it is subject to back off .. so it would retry.
Patch is definitely preferable just as long as we understand where the conflict came from.

vicentefb · 2026-05-18T21:33:17Z

1: initializeAnnotations() patches annotations using r.Patch(). While this updates the resourceVersion in the API server, the controller's cached informer does not immediately reflect this update (there is always a cache lag of a few milliseconds).
2: If a warm sandbox adoption happens, adoptSandboxFromCandidates() calls r.Update(ctx, claim). Because the local claim object in memory might still reference the old resourceVersion (or be slightly out of sync with the cache), this Update is susceptible to a conflict.
3: Finally, updateStatus() called r.Status().Update(). Since the object had already been modified in the previous steps (or by concurrent events), the memory representation's resourceVersion was almost guaranteed to be out of sync with the server's version, resulting in a conflict.

this issue happens on single resources as well, (not necessarily at scale).

barney-s · 2026-05-19T16:35:37Z

 		return err
 	}

+	logger.V(1).Info("Successfully patched sandboxclaim status",


should we level at a lower level ?

@bittermandel any suggestions.

changed to 4

updated PR based on copilot and community comments changed logging level

barney-s · 2026-05-19T16:48:07Z

/lgtm
/approve

k8s-ci-robot · 2026-05-19T16:48:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: barney-s, vicentefb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~extensions/controllers/OWNERS~~ [barney-s,vicentefb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…icate Under burst load (for example when a warm pool drains and re-mints many sandboxes at once), the Sandbox controller issues many status writes in quick succession. Status().Update() carries a ResourceVersion precondition, so when the controller's cache lags its own writes the cached ResourceVersion is stale and the API server rejects the write with a 409 Conflict. The controller requeues and retries, and the conflict/retry storm starves forward progress. This re-lands the sandbox status patch from kubernetes-sigs#509 (reverted in kubernetes-sigs#526) together with the reconcile predicate from kubernetes-sigs#532 that the revert showed was also needed, and adds write observability: - Write Sandbox status with a non-optimistic merge patch (client.MergeFrom) instead of Update, removing the ResourceVersion precondition. This matches the patterns already merged for the claim (kubernetes-sigs#508) and warm pool (kubernetes-sigs#387) controllers. - Add a Pod watch predicate that reconciles only on Phase, Ready, or PodIPs changes, suppressing the redundant reconciles the Owns(Pod) watch generates on every Kubelet phase transition. The PodIPs clause prevents a PodReady-before-PodIPs race from leaving status stuck at "Ready but no podIPs yet"; AnnotationChangedPredicate is intentionally omitted because the controller patches its own annotations. - Treat NotFound on status and annotation writes as a no-op instead of an error, so objects deleted mid-reconcile no longer cause requeues. - Add per-write metrics (agent_sandbox_writes_total, agent_sandbox_write_duration_seconds) labeled by controller, action, subresource, verb, and result, to attribute conflicts and 404s to a specific write path. Measured on a GKE cluster (150 warm pool, 10 QPS burst, c4 nodes): status write conflicts dropped from about 43/s (38% of status writes) to 0, and a 150-pool / 10 QPS burst completed 600/600 claims with zero failures at startup p50 0.83s, where the prior Update-based build conflict-stormed. Adds unit tests covering the stale-ResourceVersion conflict versus merge patch, the deleted-object no-requeue behavior, and the Pod predicate.

…ubernetes-sigs#508) updated PR based on copilot and community comments changed logging level

k8s-ci-robot requested review from barney-s and janetkuo April 2, 2026 19:00

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 2, 2026

justinsb mentioned this pull request Apr 2, 2026

optimization: replace .Update() with .Patch() for sandbox updateStatus #509

Merged

justinsb self-assigned this Apr 2, 2026

justinsb added the action-required: resolve-copilot-comments label Apr 2, 2026

vicentefb force-pushed the patchClaimStatus branch from 2909e6b to 203902d Compare April 3, 2026 22:13

vicentefb mentioned this pull request Apr 6, 2026

[Performance] Mitigate redundant Reconciles #527

Open

janetkuo requested a review from Copilot April 21, 2026 17:00

Copilot started reviewing on behalf of janetkuo April 21, 2026 17:00 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread extensions/controllers/sandboxclaim_controller.go Outdated

Comment thread extensions/controllers/sandboxclaim_controller.go Outdated

janetkuo added this to Agent Sandbox Apr 28, 2026

github-project-automation Bot moved this to Backlog in Agent Sandbox Apr 28, 2026

janetkuo removed this from Agent Sandbox Apr 29, 2026

janetkuo added this to Agent Sandbox Apr 29, 2026

github-project-automation Bot moved this to Backlog in Agent Sandbox Apr 29, 2026

janetkuo removed this from Agent Sandbox Apr 29, 2026

janetkuo added this to Agent Sandbox Apr 29, 2026

github-project-automation Bot moved this to Backlog in Agent Sandbox Apr 29, 2026

bittermandel reviewed May 18, 2026

View reviewed changes

Comment thread extensions/controllers/sandboxclaim_controller.go Outdated

vicentefb force-pushed the patchClaimStatus branch from 203902d to 9f5e79c Compare May 18, 2026 21:20

janetkuo added ready-for-review and removed action-required: resolve-copilot-comments labels May 18, 2026

barney-s reviewed May 19, 2026

View reviewed changes

optimization: replace .Update() with .Patch() for claim updateStatus

eab585b

updated PR based on copilot and community comments changed logging level

vicentefb force-pushed the patchClaimStatus branch from 9f5e79c to eab585b Compare May 19, 2026 16:41

k8s-ci-robot assigned barney-s May 19, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026

k8s-ci-robot merged commit 0498c10 into kubernetes-sigs:main May 19, 2026
11 checks passed

github-project-automation Bot moved this from Backlog to Done in Agent Sandbox May 19, 2026

geojaz mentioned this pull request May 29, 2026

fix: write Sandbox status with merge-patch and add Pod reconcile predicate #882

Open

khirotaka pushed a commit to khirotaka/agent-sandbox that referenced this pull request Jun 12, 2026

optimization: replace .Update() with .Patch() for claim updateStatus (k…

d8a841a

…ubernetes-sigs#508) updated PR based on copilot and community comments changed logging level

alexatakvelon pushed a commit to volatilemolotov/agent-sandbox that referenced this pull request Jun 24, 2026

optimization: replace .Update() with .Patch() for claim updateStatus (k…

7d44b27

…ubernetes-sigs#508) updated PR based on copilot and community comments changed logging level

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimization: replace .Update() with .Patch() for claim updateStatus#508

optimization: replace .Update() with .Patch() for claim updateStatus#508
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
vicentefb:patchClaimStatus

vicentefb commented Apr 2, 2026 •

edited

Loading

netlify Bot commented Apr 2, 2026 •

edited

Loading

justinsb commented Apr 2, 2026

vicentefb commented Apr 3, 2026

Copilot AI left a comment

Uh oh!

Uh oh!

bittermandel commented May 18, 2026

Uh oh!

vicentefb commented May 18, 2026

barney-s commented May 18, 2026

bittermandel commented May 18, 2026

bittermandel commented May 18, 2026

barney-s commented May 18, 2026

vicentefb commented May 18, 2026 •

edited

Loading

barney-s May 19, 2026

vicentefb May 19, 2026

barney-s commented May 19, 2026

k8s-ci-robot commented May 19, 2026

Uh oh!

Labels

7 participants

Uh oh!

Conversation

vicentefb commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

netlify Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for agent-sandbox canceled.

justinsb commented Apr 2, 2026

vicentefb commented Apr 3, 2026

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

bittermandel commented May 18, 2026

Uh oh!

vicentefb commented May 18, 2026

barney-s commented May 18, 2026

bittermandel commented May 18, 2026

bittermandel commented May 18, 2026

barney-s commented May 18, 2026

vicentefb commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

barney-s May 19, 2026

Choose a reason for hiding this comment

vicentefb May 19, 2026

Choose a reason for hiding this comment

barney-s commented May 19, 2026

k8s-ci-robot commented May 19, 2026

Uh oh!

Labels

7 participants

vicentefb commented Apr 2, 2026 •

edited

Loading

netlify Bot commented Apr 2, 2026 •

edited

Loading

vicentefb commented May 18, 2026 •

edited

Loading