api: Remove spec.replicas and introduce spec.operatingMode for suspend and resume#801
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
There was a problem hiding this comment.
Pull request overview
This PR removes the spec.replicas field (and the corresponding status.replicas / status.selector and scale subresource) from the v1alpha1 Sandbox API and replaces it with an explicit spec.mode enum (Running | Suspended, default Running). The reconciler, extension controllers, generated CRDs, Python SDK suspend/resume logic, e2e tests, docs, and the roadmap are all updated to use the new mode-based vocabulary. This is explicitly called out as a breaking change.
Changes:
- Introduce
SandboxModetype withRunning/Suspendedconstants; removeReplicas,LabelSelector, and the scale subresource from the Sandbox API and generated CRDs. - Update the Sandbox controller, SandboxClaim controller, SandboxWarmPool controller, and their tests to set/check
Spec.Modeinstead ofSpec.Replicas, including new condition message "Pod does not exist, mode is Suspended". - Update the Python
gke_extensionssnapshot support (is_suspended,suspend,resumeand tests/README) to patchspec.modeinstead ofspec.replicas, and refresh comments insandbox.py/async_sandbox.pyand the roadmap.
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| api/v1alpha1/sandbox_types.go | Adds SandboxMode enum and Spec.Mode; removes Spec.Replicas, Status.Replicas, Status.LabelSelector, and the scale subresource marker. |
| api/v1alpha1/zz_generated.deepcopy.go | Drops the now-removed Replicas deepcopy block. |
| k8s/crds/agents.x-k8s.io_sandboxes.yaml, helm/crds/agents.x-k8s.io_sandboxes.yaml | Regenerated CRDs reflecting the new schema (no replicas/selector, no scale subresource, new mode enum with default Running). |
| controllers/sandbox_controller.go | Defaults Spec.Mode; deletes pod when Mode == Suspended; stops populating Status.Replicas/Status.LabelSelector; updates log/condition messages; removes k8s.io/utils/ptr import. |
| controllers/sandbox_controller_test.go | Test cases switched to Mode: Running/Suspended and expected statuses no longer assert Replicas/LabelSelector. |
| extensions/controllers/sandboxclaim_controller.go | Sets sandbox.Spec.Mode = Running instead of the old replicas workaround. |
| extensions/controllers/sandboxclaim_controller_test.go, sandboxclaim_pod_exclusivity_test.go, sandboxwarmpool_controller.go, sandboxwarmpool_controller_test.go | Updated to use Mode in test fixtures and pool sandbox creation. |
| test/e2e/basic_test.go, shutdown_test.go, volumeclaimtemplate_test.go, mode_test.go | Removed assertions on Replicas/LabelSelector; renamed TestSandboxReplicas → TestSandboxMode and updated suspend flow. |
| clients/python/.../sandbox_with_snapshot_support.py + tests + README | _set_replicas → _set_mode; is_suspended reads spec.mode; messages and docs updated. |
| clients/python/.../sandbox.py, async_sandbox.py | Comment updates referring to spec.mode instead of spec.replicas. |
| roadmap.md | Wording change from "replicas scale to 0" to "mode is set to Suspended/Running". |
Files not reviewed (1)
- api/v1alpha1/zz_generated.deepcopy.go: Language not supported
Comments suppressed due to low confidence (1)
test/e2e/mode_test.go:11
- Two lines of the standard Apache 2.0 license header were accidentally deleted in this file. The current header now jumps from "...distributed on an "AS IS" BASIS," straight to "limitations under the License." — the lines "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied." and "See the License for the specific language governing permissions and" need to be restored. This deletion is unrelated to the spec.replicas → spec.mode rename and should be reverted.
| Upgrading from Alpha to Beta is designed to be seamless for end-users, relying heavily on native Kubernetes API defaulting mechanisms to prevent disruption. | ||
|
|
||
| 1. **CRD Update:** The cluster administrator applies the updated `Sandbox` CRD containing the new `spec.operatingMode` Enum field. | ||
| 2. **Defaulting Behavior:** Because the `spec.operatingMode` field is defined with `// +kubebuilder:default=Running`, all existing Sandbox resources in the cluster will automatically be treated as `Running` by the API server. |
There was a problem hiding this comment.
The defaulting argument only holds for Sandboxes that were running at upgrade time. What about Sandboxes a user had explicitly suspended (spec.replicas: 0) before the upgrade?
There was a problem hiding this comment.
Very good point.
Because of the automatic defaulting to Running, any Sandbox that was explicitly suspended (spec.replicas: 0) in the Alpha version will be automatically resumed. If administrators or users wish to keep these Sandboxes suspended across the upgrade, they must patch the existing Sandbox resources to explicitly set spec.operatingMode: Suspended prior to upgrading the controller. I have updated the details in the KEP and kept it brief.
The full migration details will be added in it's own PR: https://github.com/kubernetes-sigs/agent-sandbox/pull/848/changes which talks about how to do alpha sandbox deletion, patching the sandboxes correctly.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aditya-shantanu, janetkuo, SHRUTI6991, vicentefb The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Upstream PR kubernetes-sigs#801 removed SandboxStatus.Replicas in favor of OperatingMode. Replace the mutable status field used in the conflict tests with Status.ServiceFQDN (a string), and add the Apache 2.0 license header required by the boilerplate check.
Upstream PR kubernetes-sigs#801 removed SandboxStatus.Replicas in favor of OperatingMode. Replace the mutable status field used in the conflict tests with Status.ServiceFQDN (a string), and add the Apache 2.0 license header required by the boilerplate check.
…d and resume (kubernetes-sigs#801) * Remove spec.replicas and introduce spec.operatingMode to suspend and resume a sandbox. * temp work. * Address comments. * Update migration plan. * Update migration plan. * fix the shutdown time test. * retrigger test.
…d and resume (kubernetes-sigs#801) * Remove spec.replicas and introduce spec.operatingMode to suspend and resume a sandbox. * temp work. * Address comments. * Update migration plan. * Update migration plan. * fix the shutdown time test. * retrigger test.
Working on: #740
Description
Remove
spec.replicasand introducespec.operatingModeto represent suspension and resume behavior in Sandbox.The Suspension and Resume will now be represented as a new field
spec.operatingModewhich will haveRunningandSuspendedmodes. This is solidified in https://github.com/kubernetes-sigs/agent-sandbox/pull/762/changes. A new KEP is added in this PR which documents the decisions forspec.operatingMode.The reconciler, extension controllers, generated CRDs, Python SDK suspend/resume logic, e2e tests, docs, and the roadmap are all updated to use the new mode-based vocabulary.
Changes
SandboxModetype with Running/Suspended constants; removeReplicasand thescalesubresource from the Sandbox API and generated CRDs.Spec.OperatingModeinstead ofSpec.Replicas.Release Notes [Breaking Changes]
Backward Compatibility
We don't have full backward compatibility at this stage, however we do support handling the management of existing Sandboxes via "Defaulting" behavior. The
spec.operatingModedefault behavior is "Running".