optimization: replace .Update() with .Patch() for claim updateStatus#508
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
Let's have the "are there two controllers acting on the same object" discussion on #509 |
2909e6b to
203902d
Compare
|
I ran two very small tests by having 1 claim with a warmpool of size 2. From oss main: Using Update: It tries to update the status at Also something to note is that there's a pod collision in
With this PR: Using Patch: It fires four separate status updates to two different sandboxes (
|
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce high-frequency Operation cannot be fulfilled... write conflicts by switching SandboxClaim status updates from Status().Update(...) to a merge Status().Patch(...) in the SandboxClaim controller reconcile flow.
Changes:
- Replace
r.Status().Update(ctx, claim)withr.Status().Patch(ctx, claim, client.MergeFrom(oldClaim))for SandboxClaim status updates. - Add additional status patch scaffolding (oldClaim copy) and a success log line.
|
Any chance we can merge this? I believe I'm hitting some issues as we dont immediately retry on failed update, so some claims become ready quite slow |
|
I can update the PR, in the meantime do you mind please sharing under what circumstances are you seeing your issue of:
@bittermandel thanks! |
|
Have we identified the rootcause of the conflict in the first place @vicentefb |
we return the err here immediately with an empty result, so there's no guaranteed reconcile immediately after. |
I believe this comes from us initializing annotations here, then doing an update immediately with perhaps a state object: https://github.com/vicentefb/agent-sandbox/blob/cff4413a8d9e72dcefacf1fb03b03474c944faf1/extensions/controllers/sandboxclaim_controller.go#L169C2-L172C3 |
the controller-runtime requeues on error. But it is subject to back off .. so it would retry. |
203902d to
9f5e79c
Compare
|
1: this issue happens on single resources as well, (not necessarily at scale). |
| return err | ||
| } | ||
|
|
||
| logger.V(1).Info("Successfully patched sandboxclaim status", |
There was a problem hiding this comment.
should we level at a lower level ?
@bittermandel any suggestions.
updated PR based on copilot and community comments changed logging level
9f5e79c to
eab585b
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: barney-s, vicentefb The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…icate Under burst load (for example when a warm pool drains and re-mints many sandboxes at once), the Sandbox controller issues many status writes in quick succession. Status().Update() carries a ResourceVersion precondition, so when the controller's cache lags its own writes the cached ResourceVersion is stale and the API server rejects the write with a 409 Conflict. The controller requeues and retries, and the conflict/retry storm starves forward progress. This re-lands the sandbox status patch from kubernetes-sigs#509 (reverted in kubernetes-sigs#526) together with the reconcile predicate from kubernetes-sigs#532 that the revert showed was also needed, and adds write observability: - Write Sandbox status with a non-optimistic merge patch (client.MergeFrom) instead of Update, removing the ResourceVersion precondition. This matches the patterns already merged for the claim (kubernetes-sigs#508) and warm pool (kubernetes-sigs#387) controllers. - Add a Pod watch predicate that reconciles only on Phase, Ready, or PodIPs changes, suppressing the redundant reconciles the Owns(Pod) watch generates on every Kubelet phase transition. The PodIPs clause prevents a PodReady-before-PodIPs race from leaving status stuck at "Ready but no podIPs yet"; AnnotationChangedPredicate is intentionally omitted because the controller patches its own annotations. - Treat NotFound on status and annotation writes as a no-op instead of an error, so objects deleted mid-reconcile no longer cause requeues. - Add per-write metrics (agent_sandbox_writes_total, agent_sandbox_write_duration_seconds) labeled by controller, action, subresource, verb, and result, to attribute conflicts and 404s to a specific write path. Measured on a GKE cluster (150 warm pool, 10 QPS burst, c4 nodes): status write conflicts dropped from about 43/s (38% of status writes) to 0, and a 150-pool / 10 QPS burst completed 600/600 claims with zero failures at startup p50 0.83s, where the prior Update-based build conflict-stormed. Adds unit tests covering the stale-ResourceVersion conflict versus merge patch, the deleted-object no-requeue behavior, and the Pod predicate.
…icate Under burst load (for example when a warm pool drains and re-mints many sandboxes at once), the Sandbox controller issues many status writes in quick succession. Status().Update() carries a ResourceVersion precondition, so when the controller's cache lags its own writes the cached ResourceVersion is stale and the API server rejects the write with a 409 Conflict. The controller requeues and retries, and the conflict/retry storm starves forward progress. This re-lands the sandbox status patch from kubernetes-sigs#509 (reverted in kubernetes-sigs#526) together with the reconcile predicate from kubernetes-sigs#532 that the revert showed was also needed, and adds write observability: - Write Sandbox status with a non-optimistic merge patch (client.MergeFrom) instead of Update, removing the ResourceVersion precondition. This matches the patterns already merged for the claim (kubernetes-sigs#508) and warm pool (kubernetes-sigs#387) controllers. - Add a Pod watch predicate that reconciles only on Phase, Ready, or PodIPs changes, suppressing the redundant reconciles the Owns(Pod) watch generates on every Kubelet phase transition. The PodIPs clause prevents a PodReady-before-PodIPs race from leaving status stuck at "Ready but no podIPs yet"; AnnotationChangedPredicate is intentionally omitted because the controller patches its own annotations. - Treat NotFound on status and annotation writes as a no-op instead of an error, so objects deleted mid-reconcile no longer cause requeues. - Add per-write metrics (agent_sandbox_writes_total, agent_sandbox_write_duration_seconds) labeled by controller, action, subresource, verb, and result, to attribute conflicts and 404s to a specific write path. Measured on a GKE cluster (150 warm pool, 10 QPS burst, c4 nodes): status write conflicts dropped from about 43/s (38% of status writes) to 0, and a 150-pool / 10 QPS burst completed 600/600 claims with zero failures at startup p50 0.83s, where the prior Update-based build conflict-stormed. Adds unit tests covering the stale-ResourceVersion conflict versus merge patch, the deleted-object no-requeue behavior, and the Pod predicate.
…ubernetes-sigs#508) updated PR based on copilot and community comments changed logging level
…ubernetes-sigs#508) updated PR based on copilot and community comments changed logging level
In an effort to reduce "Operation cannot be fulfilled..." conflicts at scale, this PR switches to patching to the status of Sandbox Claim resrouce status.
Tests from main without this change indicate:
322 operation cannot be fulfilled conflicts from sandboxclaim (protoPayload.resourceName="pods/sandboxclaim-" OR protoPayload.resourceName="sandboxclaims/")
With this change, 0 conflicts.
Test paramters:
Deployment args