Option for snapshotClass/Claim for suspend and resume#1077
Option for snapshotClass/Claim for suspend and resume#1077ryanzhang-oss wants to merge 8 commits into
Conversation
Co-authored-by: Janet Kuo <chiachenk@google.com>
Signed-off-by: Ryan Zhang <zhangryan@microsoft.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ryanzhang-oss The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for agent-sandbox canceled.
|
✅ Deploy Preview for agent-sandbox canceled.
|
|
📝 WalkthroughWalkthroughAdds a new Kubernetes Enhancement Proposal (KEP-0694) consisting of a metadata file and design README describing Suspend/Resume support for Agent Sandboxes, including operating modes, suspension strategies, two alternative snapshot API designs, a SnapshotDriver gRPC interface, status model, and resume reconciliation flow. ChangesKEP-0694 Documentation
Estimated code review effort: 2 (Simple) | ~15 minutes Related Issues: No related issues referenced. Related PRs: No related PRs referenced. Suggested labels: kep, documentation, sig-apps Suggested reviewers: No reviewer information available in the provided context. 🐰 A poem for the KEPA rabbit dreams beside its burrow deep, 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds a new KEP under docs/keps/ proposing a snapshot provider API (with two alternative designs) to support Sandbox suspend/resume via snapshotting, including API sketches and a proposed gRPC driver contract.
Changes:
- Introduces a new KEP README detailing motivation, phased rollout, two snapshot API options (SnapshotProvider vs SnapshotClass/Claim/Template), and a gRPC interface proposal.
- Adds KEP metadata (
kep.yaml) for the new KEP directory.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md | New KEP document describing suspend/resume snapshot API options and proposed driver interface. |
| docs/keps/694-suspend-and-resume-plus-snapshot-api/kep.yaml | New KEP metadata file for the proposed KEP. |
| ## Non-Goals | ||
|
|
||
| * Implementing the underlying snapshotting mechanisms natively inside the controller (these will be delegated to specific runtime drivers). | ||
| * Implementing an exhaustive list of extendible features like Snapshot retention, Garbage collection, Networking state, configuring auto suspend policies etc. |
|
|
||
| When `spec.operatingMode` is set to `Running` (or omitted), the controller executes the following logic sequence during its reconciliation loop: | ||
|
|
||
| - **Observe Current State:** The controller queries the API server to check for the existence of the underlying runner Pod associated with the AgentSandbox. |
| title: Suspend And Resume + Snapshot Provider API | ||
| kep-number: 694 |
| - **Stateful Restore (Hibernate/Freeze):** If a snapshot exists, the controller reads the `snapshotClassName` configuration referenced in `spec.suspensionStrategy` to retrieve storage backend parameters. It coordinates with the pluggable snapshot driver to restore the live process memory and filesystem state into a newly provisioned runner Pod shell. | ||
| - **Cold Boot (Stop / Default):** If no snapshot exists, the controller invokes the Pod driver to construct a fresh runner Pod shell from scratch using the Sandbox's original `spec.podTemplate` and binds it to the existing Persistent Volume Claims (PVCs). |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md`:
- Around line 759-765: The resume flow in the Running reconciliation logic
currently pulls snapshot configuration directly from `spec.suspensionStrategy`,
but the strategy schema does not define `snapshotClassName`. Update the restore
path in the controller logic to first resolve the selected backend object,
either the relevant `SnapshotProvider` or `SnapshotClaim`, and then read that
object’s configuration for restore parameters. Make this change in the
reconciliation sequence that handles snapshot restore versus cold boot so the
restore step uses the backend object instead of an unsupported strategy field.
- Around line 236-240: The diagram text uses a non-existent SnapshotProviderSpec
property name; update the controller response description to match the schema
field. Replace the reference to spec.s3Config with SnapshotProviderSpec.s3 in
the relevant sequence step so the docs align with the actual API model.
- Around line 210-223: The Sandbox YAML example is presented as if it were valid
current v1beta1 API, but it uses draft fields like spec.suspensionStrategy and
hibernate.snapshotProviderRef that are not in the CRD. Update the example text
around the Sandbox manifest to clearly label it as proposed/draft future schema,
and apply the same disclaimer to the later Option B examples so readers do not
copy invalid current-api YAML. Use the surrounding Sandbox and Option B example
sections to make the schema status explicit without changing the illustrative
structure.
- Line 566: The current wording incorrectly merges the two backend-selection
flows: Option A resolves a SnapshotProvider directly, while Option B goes
through a resolved SnapshotClaim. Update the README section around the
SnapshotClaim/SnapshotClaimTemplate discussion to keep these paths separate,
explicitly describing Option A as provider resolution and Option B as
claim-based resolution. Reference the SnapshotClaim, SnapshotClaimTemplate, and
SnapshotProvider terms so the distinction stays clear and the architecture flow
does not imply all checkpoint paths resolve through a claim.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8ea991f5-c31b-42b0-a30d-6da28fd2fcb6
📒 Files selected for processing (2)
docs/keps/694-suspend-and-resume-plus-snapshot-api/README.mddocs/keps/694-suspend-and-resume-plus-snapshot-api/kep.yaml
| **Developer/Agent references the provider in Sandbox:** | ||
| ```yaml | ||
| apiVersion: agents.x-k8s.io/v1beta1 | ||
| kind: Sandbox | ||
| metadata: | ||
| name: billing-agent-42 | ||
| spec: | ||
| operatingMode: "Suspended" | ||
| suspensionStrategy: | ||
| type: "Hibernate" | ||
| hibernate: | ||
| snapshotProviderRef: | ||
| name: "s3-fast" | ||
| ``` |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Mark the example as proposed schema, not current API.
spec.suspensionStrategy and hibernate.snapshotProviderRef are not present in the current Sandbox v1beta1 CRD, so this YAML will fail validation if copied as-is. Please label it explicitly as draft/future schema, and apply the same caveat to the later Option B examples.
📝 Proposed fix
+> **Proposed schema:** the following fields are not available in the current `Sandbox` v1beta1 CRD.
```yaml
apiVersion: agents.x-k8s.io/v1beta1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Developer/Agent references the provider in Sandbox:** | |
| ```yaml | |
| apiVersion: agents.x-k8s.io/v1beta1 | |
| kind: Sandbox | |
| metadata: | |
| name: billing-agent-42 | |
| spec: | |
| operatingMode: "Suspended" | |
| suspensionStrategy: | |
| type: "Hibernate" | |
| hibernate: | |
| snapshotProviderRef: | |
| name: "s3-fast" | |
| ``` | |
| > **Proposed schema:** the following fields are not available in the current `Sandbox` v1beta1 CRD. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md` around lines
210 - 223, The Sandbox YAML example is presented as if it were valid current
v1beta1 API, but it uses draft fields like spec.suspensionStrategy and
hibernate.snapshotProviderRef that are not in the CRD. Update the example text
around the Sandbox manifest to clearly label it as proposed/draft future schema,
and apply the same disclaimer to the later Option B examples so readers do not
copy invalid current-api YAML. Use the surrounding Sandbox and Option B example
sections to make the schema status explicit without changing the illustrative
structure.
Source: Path instructions
| User->>KubeAPIServer: Update Sandbox (operatingMode = Suspended, providerRef = s3-fast) | ||
| Controller->>KubeAPIServer: Watch Event: Sandbox updated | ||
| Controller->>KubeAPIServer: Fetch SnapshotProvider 's3-fast' | ||
| KubeAPIServer-->>Controller: Return SnapshotProvider CR (with strongly-typed spec.s3Config) | ||
| Controller->>Driver: Route checkpoint request with S3 configurations |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Use the field name from the schema.
SnapshotProviderSpec defines s3, not s3Config, so this diagram currently points at a property that does not exist.
💡 Proposed fix
- KubeAPIServer-->>Controller: Return SnapshotProvider CR (with strongly-typed spec.s3Config)
+ KubeAPIServer-->>Controller: Return SnapshotProvider CR (with strongly-typed spec.s3)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| User->>KubeAPIServer: Update Sandbox (operatingMode = Suspended, providerRef = s3-fast) | |
| Controller->>KubeAPIServer: Watch Event: Sandbox updated | |
| Controller->>KubeAPIServer: Fetch SnapshotProvider 's3-fast' | |
| KubeAPIServer-->>Controller: Return SnapshotProvider CR (with strongly-typed spec.s3Config) | |
| Controller->>Driver: Route checkpoint request with S3 configurations | |
| User->>KubeAPIServer: Update Sandbox (operatingMode = Suspended, providerRef = s3-fast) | |
| Controller->>KubeAPIServer: Watch Event: Sandbox updated | |
| Controller->>KubeAPIServer: Fetch SnapshotProvider 's3-fast' | |
| KubeAPIServer-->>Controller: Return SnapshotProvider CR (with strongly-typed spec.s3) | |
| Controller->>Driver: Route checkpoint request with S3 configurations |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md` around lines
236 - 240, The diagram text uses a non-existent SnapshotProviderSpec property
name; update the controller response description to match the schema field.
Replace the reference to spec.s3Config with SnapshotProviderSpec.s3 in the
relevant sequence step so the docs align with the actual API model.
Source: Path instructions
| Controller->>KubeAPIServer: Update Sandbox Status (Condition Suspended = True) | ||
| ``` | ||
|
|
||
| > A `SnapshotClaim` referenced by `snapshotClaimName` is **shared**: every Sandbox that names it points at the same claim and the same storage target. A `SnapshotClaimTemplate` referenced by `snapshotClaimTemplateName` is **generative**: the controller stamps out a distinct `SnapshotClaim` per Sandbox by copying the template's `spec` verbatim, sets the Sandbox as its owner (so it is garbage-collected with the Sandbox), and lets the driver derive a per-Sandbox storage key from the Sandbox identity. Either way the checkpoint path only ever reads a resolved `SnapshotClaim`, so the class-resolution flow above is identical — the only difference is whether that claim is shared or generated. This mirrors DRA's `resourceClaimName` (shared) vs `resourceClaimTemplateName` (one generated `ResourceClaim` per Pod) distinction. |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Keep Option A and Option B resolution paths distinct.
This note says every checkpoint path reads a resolved SnapshotClaim, but Option A resolves SnapshotProvider directly and never introduces a claim. As written, the architecture flow collapses two different backend-selection paths into one.
🛠 Proposed fix
- Either way the checkpoint path only ever reads a resolved SnapshotClaim, so the class-resolution flow above is identical — the only difference is whether that claim is shared or generated.
+ Either way the checkpoint path first resolves the backend configuration object, then performs checkpointing. Option A resolves a SnapshotProvider directly; Option B resolves a SnapshotClaim (shared or generated) and then its SnapshotClass.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| > A `SnapshotClaim` referenced by `snapshotClaimName` is **shared**: every Sandbox that names it points at the same claim and the same storage target. A `SnapshotClaimTemplate` referenced by `snapshotClaimTemplateName` is **generative**: the controller stamps out a distinct `SnapshotClaim` per Sandbox by copying the template's `spec` verbatim, sets the Sandbox as its owner (so it is garbage-collected with the Sandbox), and lets the driver derive a per-Sandbox storage key from the Sandbox identity. Either way the checkpoint path only ever reads a resolved `SnapshotClaim`, so the class-resolution flow above is identical — the only difference is whether that claim is shared or generated. This mirrors DRA's `resourceClaimName` (shared) vs `resourceClaimTemplateName` (one generated `ResourceClaim` per Pod) distinction. | |
| > A `SnapshotClaim` referenced by `snapshotClaimName` is **shared**: every Sandbox that names it points at the same claim and the same storage target. A `SnapshotClaimTemplate` referenced by `snapshotClaimTemplateName` is **generative**: the controller stamps out a distinct `SnapshotClaim` per Sandbox by copying the template's `spec` verbatim, sets the Sandbox as its owner (so it is garbage-collected with the Sandbox), and lets the driver derive a per-Sandbox storage key from the Sandbox identity. Either way the checkpoint path first resolves the backend configuration object, then performs checkpointing. Option A resolves a `SnapshotProvider` directly; Option B resolves a `SnapshotClaim` (shared or generated) and then its `SnapshotClass`. This mirrors DRA's `resourceClaimName` (shared) vs `resourceClaimTemplateName` (one generated `ResourceClaim` per Pod) distinction. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md` at line 566,
The current wording incorrectly merges the two backend-selection flows: Option A
resolves a SnapshotProvider directly, while Option B goes through a resolved
SnapshotClaim. Update the README section around the
SnapshotClaim/SnapshotClaimTemplate discussion to keep these paths separate,
explicitly describing Option A as provider resolution and Option B as
claim-based resolution. Reference the SnapshotClaim, SnapshotClaimTemplate, and
SnapshotProvider terms so the distinction stays clear and the architecture flow
does not imply all checkpoint paths resolve through a claim.
Source: Path instructions
| When `spec.operatingMode` is set to `Running` (or omitted), the controller executes the following logic sequence during its reconciliation loop: | ||
|
|
||
| - **Observe Current State:** The controller queries the API server to check for the existence of the underlying runner Pod associated with the AgentSandbox. | ||
| - **Evaluate Restore Necessity:** If no active runner Pod is found, the controller checks if a snapshot of the memory/filesystem state exists for this Sandbox. | ||
| - **Reconcile Discrepancy (Restore or Boot):** | ||
| - **Stateful Restore (Hibernate/Freeze):** If a snapshot exists, the controller reads the `snapshotClassName` configuration referenced in `spec.suspensionStrategy` to retrieve storage backend parameters. It coordinates with the pluggable snapshot driver to restore the live process memory and filesystem state into a newly provisioned runner Pod shell. | ||
| - **Cold Boot (Stop / Default):** If no snapshot exists, the controller invokes the Pod driver to construct a fresh runner Pod shell from scratch using the Sandbox's original `spec.podTemplate` and binds it to the existing Persistent Volume Claims (PVCs). |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Resolve the backend object before restore.
The resume flow currently reads snapshotClassName from spec.suspensionStrategy, but that field is not part of the proposed strategy schema. The controller needs to resolve the selected SnapshotProvider or SnapshotClaim first, then use that object’s configuration.
🧭 Proposed fix
- - **Stateful Restore (Hibernate/Freeze):** If a snapshot exists, the controller reads the `snapshotClassName` configuration referenced in `spec.suspensionStrategy` to retrieve storage backend parameters. It coordinates with the pluggable snapshot driver to restore the live process memory and filesystem state into a newly provisioned runner Pod shell.
+ - **Stateful Restore (Hibernate/Freeze):** If a snapshot exists, the controller resolves the selected `SnapshotProvider` (Option A) or `SnapshotClaim`/`SnapshotClass` pair (Option B) to retrieve storage backend parameters. It coordinates with the pluggable snapshot driver to restore the live process memory and filesystem state into a newly provisioned runner Pod shell.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| When `spec.operatingMode` is set to `Running` (or omitted), the controller executes the following logic sequence during its reconciliation loop: | |
| - **Observe Current State:** The controller queries the API server to check for the existence of the underlying runner Pod associated with the AgentSandbox. | |
| - **Evaluate Restore Necessity:** If no active runner Pod is found, the controller checks if a snapshot of the memory/filesystem state exists for this Sandbox. | |
| - **Reconcile Discrepancy (Restore or Boot):** | |
| - **Stateful Restore (Hibernate/Freeze):** If a snapshot exists, the controller reads the `snapshotClassName` configuration referenced in `spec.suspensionStrategy` to retrieve storage backend parameters. It coordinates with the pluggable snapshot driver to restore the live process memory and filesystem state into a newly provisioned runner Pod shell. | |
| - **Cold Boot (Stop / Default):** If no snapshot exists, the controller invokes the Pod driver to construct a fresh runner Pod shell from scratch using the Sandbox's original `spec.podTemplate` and binds it to the existing Persistent Volume Claims (PVCs). | |
| When `spec.operatingMode` is set to `Running` (or omitted), the controller executes the following logic sequence during its reconciliation loop: | |
| - **Observe Current State:** The controller queries the API server to check for the existence of the underlying runner Pod associated with the AgentSandbox. | |
| - **Evaluate Restore Necessity:** If no active runner Pod is found, the controller checks if a snapshot of the memory/filesystem state exists for this Sandbox. | |
| - **Reconcile Discrepancy (Restore or Boot):** | |
| - **Stateful Restore (Hibernate/Freeze):** If a snapshot exists, the controller resolves the selected `SnapshotProvider` (Option A) or `SnapshotClaim`/`SnapshotClass` pair (Option B) to retrieve storage backend parameters. It coordinates with the pluggable snapshot driver to restore the live process memory and filesystem state into a newly provisioned runner Pod shell. | |
| - **Cold Boot (Stop / Default):** If no snapshot exists, the controller invokes the Pod driver to construct a fresh runner Pod shell from scratch using the Sandbox's original `spec.podTemplate` and binds it to the existing Persistent Volume Claims (PVCs). |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md` around lines
759 - 765, The resume flow in the Running reconciliation logic currently pulls
snapshot configuration directly from `spec.suspensionStrategy`, but the strategy
schema does not define `snapshotClassName`. Update the restore path in the
controller logic to first resolve the selected backend object, either the
relevant `SnapshotProvider` or `SnapshotClaim`, and then read that object’s
configuration for restore parameters. Make this change in the reconciliation
sequence that handles snapshot restore versus cold boot so the restore step uses
the backend object instead of an unsupported strategy field.
Source: Path instructions
What this PR does / why we need it:
Which issue(s) this PR is related to:
Release Note
Summary by CodeRabbit