Skip to content

optimization: replace .Update() with .Patch() for claim updateStatus#508

Merged
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
vicentefb:patchClaimStatus
May 19, 2026
Merged

optimization: replace .Update() with .Patch() for claim updateStatus#508
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
vicentefb:patchClaimStatus

Conversation

@vicentefb

@vicentefb vicentefb commented Apr 2, 2026

Copy link
Copy Markdown
Member

In an effort to reduce "Operation cannot be fulfilled..." conflicts at scale, this PR switches to patching to the status of Sandbox Claim resrouce status.

Tests from main without this change indicate:

322 operation cannot be fulfilled conflicts from sandboxclaim (protoPayload.resourceName="pods/sandboxclaim-" OR protoPayload.resourceName="sandboxclaims/")

With this change, 0 conflicts.

Test paramters:

# BURST_SIZE * TOTAL_BURSTS = Total sandbox claims created
BURST_SIZE=300
QPS=300
TOTAL_BURSTS=5
WARMPOOL_SIZE=600
RUNTIME_CLASS="" # Change to "gvisor" if your cluster supports it

Deployment args

        args:
        - "--leader-elect=true"
        - "--extensions"
        - "--enable-tracing=true"
        - --zap-log-level=debug
        - --zap-encoder=json
        - --enable-pprof-debug
        - --kube-api-qps=1000
        - --kube-api-burst=2000
        - --sandbox-concurrent-workers=400
        - --sandbox-claim-concurrent-workers=400
        - --sandbox-warm-pool-concurrent-workers=1
@netlify

netlify Bot commented Apr 2, 2026

Copy link
Copy Markdown

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit eab585b
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/6a0c92af181e3200087887c2
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 2, 2026
@justinsb justinsb self-assigned this Apr 2, 2026
@justinsb

justinsb commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

Let's have the "are there two controllers acting on the same object" discussion on #509

@vicentefb

Copy link
Copy Markdown
Member Author

I ran two very small tests by having 1 claim with a warmpool of size 2.

BURST_SIZE=1
QPS=1
TOTAL_BURSTS=10
WARMPOOL_SIZE=2

From oss main:

Using Update: It tries to update the status at 6. Just later (9), it tries to update it again and hits a 409 Conflict because the ResourceVersion shifted.

Also something to note is that there's a pod collision in 26 due to this update. The controller is trying to update the pod's labels while the Kubelet is simultaneously updating the pod's status. Separate issue where could use .Patch()

Step Delta (ms) Method Target Resource Result / Notes
1 0 PATCH sandboxclaims/agent-claim-4 🟢 Test Runner injects Claim
2 17 UPDATE sandboxes/warmpool-0-6qkhx 🟢 Claim Controller adopts Sandbox
3 28 DELETE networkpolicies/agent-claim-4-network-policy 🟡 404 Not Found
4 32 CREATE events/agent-claim-4... 🟢 K8s records adoption event
5 42 CREATE sandboxes 🟢 WarmPool orders replacement
6 45 UPDATE sandboxclaims/agent-claim-4/status 🟢 Claim Controller updates status
7 56 DELETE networkpolicies/agent-claim-4-network-policy 🟡 404 Not Found (Retry)
8 58 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
9 67 UPDATE sandboxclaims/agent-claim-4/status 🔴 409 Conflict (Claim collides w/ itself)
10 68 UPDATE pods/warmpool-0-6qkhx 🟢 Sandbox Controller updates pod
11 71 PATCH sandboxes/warmpool-0-qcblt 🟢 WarmPool configures replacement
12 77 DELETE networkpolicies/agent-claim-4-network-policy 🟡 404 Not Found (Retry)
13 85 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
14 90 PATCH sandboxes/warmpool-0-6qkhx/status 🟢 SUCCESS: Sandbox Status patched
15 97 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
16 111 DELETE networkpolicies/agent-claim-4-network-policy 🟡 404 Not Found (Retry)
17 127 CREATE pods/warmpool-0-qcblt 🟢 Sandbox Controller provisions pod
18 129 UPDATE sandboxclaims/agent-claim-4/status 🟢 Retry Succeeds: Claim status resolves
19 132 UPDATE pods/warmpool-0-6qkhx 🟢 Sandbox Controller updates pod
20 147 PATCH sandboxes/warmpool-0-6qkhx/status 🟢 SUCCESS: Sandbox Status patched again
21 150 DELETE networkpolicies/agent-claim-4-network-policy 🟡 404 Not Found (Retry)
22 152 PATCH sandboxes/warmpool-0-qcblt 🟢 WarmPool configures replacement
23 166 CREATE services/warmpool-0-qcblt 🟢 Service provisioned
24 191 UPDATE pods/warmpool-0-6qkhx 🟢 Sandbox Controller updates pod
25 193 PATCH sandboxes/warmpool-0-qcblt/status 🟢 SUCCESS: New Sandbox Status patched
26 244 UPDATE pods/warmpool-0-qcblt 🔴 409 Conflict (Sandbox collides w/ Kubelet)
27 264 PATCH sandboxes/warmpool-0-qcblt/status 🟢 SUCCESS: New Sandbox Status patched again
28 312 UPDATE pods/warmpool-0-qcblt 🟢 Sandbox Controller updates pod
29 361 UPDATE pods/warmpool-0-qcblt 🟢 Sandbox Controller updates pod
30 380 PATCH sandboxes/warmpool-0-qcblt/status 🟢 SUCCESS: New Sandbox Status patched again
31 715 UPDATE pods/warmpool-0-qcblt 🟢 Sandbox Controller updates pod
32 734 PATCH sandboxes/warmpool-0-qcblt/status 🟢 SUCCESS: New Sandbox Status patched again
33 758 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
34 796 UPDATE pods/warmpool-0-qcblt 🟢 Sandbox Controller updates pod

With this PR:

Using Patch: It fires four separate status updates to two different sandboxes (6, 9, 14, 18, and 21.). Even though the rest of the cluster is actively changing metadata and provisioning resources, every single one of those patches succeeds with zero conflicts.

Step Delta (ms) Method Target Resource Result / Notes
1 0 PATCH sandboxclaims/agent-claim-4 🟢 Start: Test Runner injects Claim 4
2 19 UPDATE sandboxes/warmpool-0-8wljs 🟢 Claim Controller adopts Sandbox
3 31 DELETE networkpolicies/agent-claim-4-network-policy 🟡 404 Not Found
4 37 CREATE events/agent-claim-4.18a... 🟢 K8s records Sandbox adoption event
5 45 CREATE sandboxes 🟢 WarmPool orders replacement Sandbox
6 49 PATCH sandboxclaims/agent-claim-4/status 🟢 SUCCESS: Claim Status Patched
7 60 DELETE networkpolicies/agent-claim-4-network-policy 🟡 404 Not Found (Reconcile retry)
8 63 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
9 70 PATCH sandboxclaims/agent-claim-4/status 🟢 SUCCESS: Claim Status Patched
10 74 PATCH sandboxes/warmpool-0-7drtx 🟢 WarmPool configures replacement sandbox
11 81 UPDATE pods/warmpool-0-8wljs 🟢 Sandbox Controller updates adopted pod
12 81 DELETE networkpolicies/agent-claim-4-network-policy 🟡 404 Not Found (Reconcile retry)
13 92 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
14 100 PATCH sandboxes/warmpool-0-8wljs/status 🟢 Sandbox Controller updates status
15 116 DELETE networkpolicies/agent-claim-4-network-policy 🟡 404 Not Found (Reconcile retry)
16 116 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
17 126 CREATE pods/warmpool-0-7drtx 🟢 Sandbox Controller provisions replacement pod
18 136 PATCH sandboxclaims/agent-claim-4/status 🟢 SUCCESS: Claim Status Patched
19 140 UPDATE pods/warmpool-0-8wljs 🟢 Sandbox Controller updates adopted pod
20 145 PATCH sandboxes/warmpool-0-7drtx 🟢 WarmPool configures replacement sandbox
21 153 PATCH sandboxes/warmpool-0-8wljs/status 🟢 Sandbox Controller updates status
22 156 DELETE networkpolicies/agent-claim-4-network-policy 🟡 404 Not Found (Reconcile retry)
23 159 CREATE services/warmpool-0-7drtx 🟢 Service provisioned for new pod
24 177 PATCH sandboxes/warmpool-0-7drtx/status 🟢 Sandbox Controller updates new pod status
25 186 UPDATE pods/warmpool-0-8wljs 🟢 Sandbox Controller updates adopted pod
26 217 UPDATE pods/warmpool-0-7drtx 🔴 409 Conflict (Sandbox Controller collides w/ Kubelet on Pod)
27 236 PATCH sandboxes/warmpool-0-7drtx/status 🟢 Sandbox Controller updates status
28 279 UPDATE pods/warmpool-0-7drtx 🟢 Retry Succeeds: Pod updated.
29 337 PATCH sandboxes/warmpool-0-7drtx/status 🟢 Sandbox Controller updates status
30 932 UPDATE pods/warmpool-0-7drtx 🟢 Sandbox Controller updates replacement pod
31 950 PATCH sandboxes/warmpool-0-7drtx/status 🟢 Sandbox Controller updates status
32 975 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
33 999 UPDATE pods/warmpool-0-7drtx 🟢 Sandbox Controller updates replacement pod

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce high-frequency Operation cannot be fulfilled... write conflicts by switching SandboxClaim status updates from Status().Update(...) to a merge Status().Patch(...) in the SandboxClaim controller reconcile flow.

Changes:

  • Replace r.Status().Update(ctx, claim) with r.Status().Patch(ctx, claim, client.MergeFrom(oldClaim)) for SandboxClaim status updates.
  • Add additional status patch scaffolding (oldClaim copy) and a success log line.
Comment thread extensions/controllers/sandboxclaim_controller.go Outdated
Comment thread extensions/controllers/sandboxclaim_controller.go Outdated
@bittermandel

Copy link
Copy Markdown
Contributor

Any chance we can merge this? I believe I'm hitting some issues as we dont immediately retry on failed update, so some claims become ready quite slow

Comment thread extensions/controllers/sandboxclaim_controller.go Outdated
@vicentefb

Copy link
Copy Markdown
Member Author

I can update the PR, in the meantime do you mind please sharing under what circumstances are you seeing your issue of:

we dont immediately retry on failed update, so some claims become ready quite slow

@bittermandel thanks!

@barney-s

Copy link
Copy Markdown
Collaborator

Have we identified the rootcause of the conflict in the first place @vicentefb
We know the patch helps but what is causing it in the first place.

@bittermandel

Copy link
Copy Markdown
Contributor

I can update the PR, in the meantime do you mind please sharing under what circumstances are you seeing your issue of:

we dont immediately retry on failed update, so some claims become ready quite slow

@bittermandel thanks!

we return the err here immediately with an empty result, so there's no guaranteed reconcile immediately after.
https://github.com/vicentefb/agent-sandbox/blob/203902d7f295e2e5ea23e717149db716ac9ff1f9/extensions/controllers/sandboxclaim_controller.go#L165

@bittermandel

Copy link
Copy Markdown
Contributor

Have we identified the rootcause of the conflict in the first place @vicentefb We know the patch helps but what is causing it in the first place.

I believe this comes from us initializing annotations here, then doing an update immediately with perhaps a state object: https://github.com/vicentefb/agent-sandbox/blob/cff4413a8d9e72dcefacf1fb03b03474c944faf1/extensions/controllers/sandboxclaim_controller.go#L169C2-L172C3

@barney-s

Copy link
Copy Markdown
Collaborator

I can update the PR, in the meantime do you mind please sharing under what circumstances are you seeing your issue of:

we dont immediately retry on failed update, so some claims become ready quite slow

@bittermandel thanks!

we return the err here immediately with an empty result, so there's no guaranteed reconcile immediately after. https://github.com/vicentefb/agent-sandbox/blob/203902d7f295e2e5ea23e717149db716ac9ff1f9/extensions/controllers/sandboxclaim_controller.go#L165

the controller-runtime requeues on error. But it is subject to back off .. so it would retry.
Patch is definitely preferable just as long as we understand where the conflict came from.

@vicentefb

vicentefb commented May 18, 2026

Copy link
Copy Markdown
Member Author

1: initializeAnnotations() patches annotations using r.Patch(). While this updates the resourceVersion in the API server, the controller's cached informer does not immediately reflect this update (there is always a cache lag of a few milliseconds).
2: If a warm sandbox adoption happens, adoptSandboxFromCandidates() calls r.Update(ctx, claim). Because the local claim object in memory might still reference the old resourceVersion (or be slightly out of sync with the cache), this Update is susceptible to a conflict.
3: Finally, updateStatus() called r.Status().Update(). Since the object had already been modified in the previous steps (or by concurrent events), the memory representation's resourceVersion was almost guaranteed to be out of sync with the server's version, resulting in a conflict.

this issue happens on single resources as well, (not necessarily at scale).

return err
}

logger.V(1).Info("Successfully patched sandboxclaim status",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we level at a lower level ?

@bittermandel any suggestions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to 4

updated PR based on copilot and community comments

changed logging level
@barney-s

Copy link
Copy Markdown
Collaborator

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: barney-s, vicentefb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 0498c10 into kubernetes-sigs:main May 19, 2026
11 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in Agent Sandbox May 19, 2026
geojaz added a commit to geojaz/agent-sandbox that referenced this pull request May 30, 2026
…icate

Under burst load (for example when a warm pool drains and re-mints many
sandboxes at once), the Sandbox controller issues many status writes in
quick succession. Status().Update() carries a ResourceVersion precondition,
so when the controller's cache lags its own writes the cached
ResourceVersion is stale and the API server rejects the write with a 409
Conflict. The controller requeues and retries, and the conflict/retry storm
starves forward progress.

This re-lands the sandbox status patch from kubernetes-sigs#509 (reverted in kubernetes-sigs#526) together
with the reconcile predicate from kubernetes-sigs#532 that the revert showed was also
needed, and adds write observability:

- Write Sandbox status with a non-optimistic merge patch (client.MergeFrom)
  instead of Update, removing the ResourceVersion precondition. This matches
  the patterns already merged for the claim (kubernetes-sigs#508) and warm pool (kubernetes-sigs#387)
  controllers.
- Add a Pod watch predicate that reconciles only on Phase, Ready, or PodIPs
  changes, suppressing the redundant reconciles the Owns(Pod) watch generates
  on every Kubelet phase transition. The PodIPs clause prevents a
  PodReady-before-PodIPs race from leaving status stuck at "Ready but no
  podIPs yet"; AnnotationChangedPredicate is intentionally omitted because the
  controller patches its own annotations.
- Treat NotFound on status and annotation writes as a no-op instead of an
  error, so objects deleted mid-reconcile no longer cause requeues.
- Add per-write metrics (agent_sandbox_writes_total,
  agent_sandbox_write_duration_seconds) labeled by controller, action,
  subresource, verb, and result, to attribute conflicts and 404s to a
  specific write path.

Measured on a GKE cluster (150 warm pool, 10 QPS burst, c4 nodes): status
write conflicts dropped from about 43/s (38% of status writes) to 0, and a
150-pool / 10 QPS burst completed 600/600 claims with zero failures at
startup p50 0.83s, where the prior Update-based build conflict-stormed.

Adds unit tests covering the stale-ResourceVersion conflict versus merge
patch, the deleted-object no-requeue behavior, and the Pod predicate.
geojaz added a commit to geojaz/agent-sandbox that referenced this pull request Jun 2, 2026
…icate

Under burst load (for example when a warm pool drains and re-mints many
sandboxes at once), the Sandbox controller issues many status writes in
quick succession. Status().Update() carries a ResourceVersion precondition,
so when the controller's cache lags its own writes the cached
ResourceVersion is stale and the API server rejects the write with a 409
Conflict. The controller requeues and retries, and the conflict/retry storm
starves forward progress.

This re-lands the sandbox status patch from kubernetes-sigs#509 (reverted in kubernetes-sigs#526) together
with the reconcile predicate from kubernetes-sigs#532 that the revert showed was also
needed, and adds write observability:

- Write Sandbox status with a non-optimistic merge patch (client.MergeFrom)
  instead of Update, removing the ResourceVersion precondition. This matches
  the patterns already merged for the claim (kubernetes-sigs#508) and warm pool (kubernetes-sigs#387)
  controllers.
- Add a Pod watch predicate that reconciles only on Phase, Ready, or PodIPs
  changes, suppressing the redundant reconciles the Owns(Pod) watch generates
  on every Kubelet phase transition. The PodIPs clause prevents a
  PodReady-before-PodIPs race from leaving status stuck at "Ready but no
  podIPs yet"; AnnotationChangedPredicate is intentionally omitted because the
  controller patches its own annotations.
- Treat NotFound on status and annotation writes as a no-op instead of an
  error, so objects deleted mid-reconcile no longer cause requeues.
- Add per-write metrics (agent_sandbox_writes_total,
  agent_sandbox_write_duration_seconds) labeled by controller, action,
  subresource, verb, and result, to attribute conflicts and 404s to a
  specific write path.

Measured on a GKE cluster (150 warm pool, 10 QPS burst, c4 nodes): status
write conflicts dropped from about 43/s (38% of status writes) to 0, and a
150-pool / 10 QPS burst completed 600/600 claims with zero failures at
startup p50 0.83s, where the prior Update-based build conflict-stormed.

Adds unit tests covering the stale-ResourceVersion conflict versus merge
patch, the deleted-object no-requeue behavior, and the Pod predicate.
khirotaka pushed a commit to khirotaka/agent-sandbox that referenced this pull request Jun 12, 2026
…ubernetes-sigs#508)

updated PR based on copilot and community comments

changed logging level
alexatakvelon pushed a commit to volatilemolotov/agent-sandbox that referenced this pull request Jun 24, 2026
…ubernetes-sigs#508)

updated PR based on copilot and community comments

changed logging level
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ready-for-review size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

7 participants