Skip to content

Option for snapshotClass/Claim for suspend and resume#1077

Draft
ryanzhang-oss wants to merge 8 commits into
kubernetes-sigs:mainfrom
ryanzhang-oss:kep_for_suspend_and_resume
Draft

Option for snapshotClass/Claim for suspend and resume#1077
ryanzhang-oss wants to merge 8 commits into
kubernetes-sigs:mainfrom
ryanzhang-oss:kep_for_suspend_and_resume

Conversation

@ryanzhang-oss

@ryanzhang-oss ryanzhang-oss commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Which issue(s) this PR is related to:

Release Note

Summary by CodeRabbit

  • Documentation
    • Added a new KEP describing suspend/resume support for agent sandboxes.
    • Outlined supported operating modes, suspension behaviors, and snapshot-based resume flows.
    • Documented two snapshot API options and the expected driver interactions.
    • Added metadata for the new KEP entry.
Copilot AI review requested due to automatic review settings July 1, 2026 23:40
@github-project-automation github-project-automation Bot moved this to Backlog in Agent Sandbox Jul 1, 2026
@kubernetes-prow kubernetes-prow Bot requested a review from aditya-shantanu July 1, 2026 23:40
@kubernetes-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ryanzhang-oss
Once this PR has been reviewed and has the lgtm label, please assign igooch for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify

netlify Bot commented Jul 1, 2026

Copy link
Copy Markdown

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit dc6a676
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/6a45a569c12a860007760ffb
@kubernetes-prow kubernetes-prow Bot requested a review from justinsb July 1, 2026 23:40
@netlify

netlify Bot commented Jul 1, 2026

Copy link
Copy Markdown

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit dc6a676
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/6a45a569c12a860007760ffb
@linux-foundation-easycla

Copy link
Copy Markdown

CLA Missing ID

@kubernetes-prow kubernetes-prow Bot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 1, 2026
@ryanzhang-oss ryanzhang-oss marked this pull request as draft July 1, 2026 23:40
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new Kubernetes Enhancement Proposal (KEP-0694) consisting of a metadata file and design README describing Suspend/Resume support for Agent Sandboxes, including operating modes, suspension strategies, two alternative snapshot API designs, a SnapshotDriver gRPC interface, status model, and resume reconciliation flow.

Changes

KEP-0694 Documentation

Layer / File(s) Summary
KEP metadata and introduction
docs/keps/694-suspend-and-resume-plus-snapshot-api/kep.yaml, docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md
Adds KEP metadata (number, authors, status, dates) and README's TOC, motivation, use cases, personas, and phased implementation approach.
Operating mode and suspension strategies
docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md
Documents the implemented spec.operatingMode enum and three suspension strategies (Stop, Freeze, Hibernate) with their semantics.
Snapshot API design options
docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md
Compares cluster-scoped SnapshotProvider CRD (Option A) versus SnapshotClass/SnapshotClaim/SnapshotClaimTemplate model (Option B), with schemas, pros/cons, and usage diagrams.
Snapshot driver interface and status model
docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md
Defines the SnapshotDriver gRPC service (Checkpoint/Restore/Delete) and the SandboxSnapshotStatus model on Sandbox.status, plus an alternative dedicated Snapshot CR design note.
Resume reconciliation flow
docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md
Describes the reconciliation-driven resume workflow, deciding between checkpoint restore and cold boot.

Estimated code review effort: 2 (Simple) | ~15 minutes

Related Issues: No related issues referenced.

Related PRs: No related PRs referenced.

Suggested labels: kep, documentation, sig-apps

Suggested reviewers: No reviewer information available in the provided context.

🐰 A poem for the KEP

A rabbit dreams beside its burrow deep,
Of sandboxes that hibernate and sleep,
With claims and classes, drivers to restore,
Two paths debated, README to explore,
Then wakes them gently, reconciled once more.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is still the template with placeholders and no actual details, issue links, or release note content. Fill in the PR summary with what changed and why, add related issue references, and provide a release note or note that none is needed.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and matches the main change: adding a snapshotClass/Claim option for suspend and resume.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@kubernetes-prow kubernetes-prow Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 1, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new KEP under docs/keps/ proposing a snapshot provider API (with two alternative designs) to support Sandbox suspend/resume via snapshotting, including API sketches and a proposed gRPC driver contract.

Changes:

  • Introduces a new KEP README detailing motivation, phased rollout, two snapshot API options (SnapshotProvider vs SnapshotClass/Claim/Template), and a gRPC interface proposal.
  • Adds KEP metadata (kep.yaml) for the new KEP directory.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md New KEP document describing suspend/resume snapshot API options and proposed driver interface.
docs/keps/694-suspend-and-resume-plus-snapshot-api/kep.yaml New KEP metadata file for the proposed KEP.
## Non-Goals

* Implementing the underlying snapshotting mechanisms natively inside the controller (these will be delegated to specific runtime drivers).
* Implementing an exhaustive list of extendible features like Snapshot retention, Garbage collection, Networking state, configuring auto suspend policies etc.

When `spec.operatingMode` is set to `Running` (or omitted), the controller executes the following logic sequence during its reconciliation loop:

- **Observe Current State:** The controller queries the API server to check for the existence of the underlying runner Pod associated with the AgentSandbox.
Comment on lines +1 to +2
title: Suspend And Resume + Snapshot Provider API
kep-number: 694
Comment on lines +764 to +765
- **Stateful Restore (Hibernate/Freeze):** If a snapshot exists, the controller reads the `snapshotClassName` configuration referenced in `spec.suspensionStrategy` to retrieve storage backend parameters. It coordinates with the pluggable snapshot driver to restore the live process memory and filesystem state into a newly provisioned runner Pod shell.
- **Cold Boot (Stop / Default):** If no snapshot exists, the controller invokes the Pod driver to construct a fresh runner Pod shell from scratch using the Sandbox's original `spec.podTemplate` and binds it to the existing Persistent Volume Claims (PVCs).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md`:
- Around line 759-765: The resume flow in the Running reconciliation logic
currently pulls snapshot configuration directly from `spec.suspensionStrategy`,
but the strategy schema does not define `snapshotClassName`. Update the restore
path in the controller logic to first resolve the selected backend object,
either the relevant `SnapshotProvider` or `SnapshotClaim`, and then read that
object’s configuration for restore parameters. Make this change in the
reconciliation sequence that handles snapshot restore versus cold boot so the
restore step uses the backend object instead of an unsupported strategy field.
- Around line 236-240: The diagram text uses a non-existent SnapshotProviderSpec
property name; update the controller response description to match the schema
field. Replace the reference to spec.s3Config with SnapshotProviderSpec.s3 in
the relevant sequence step so the docs align with the actual API model.
- Around line 210-223: The Sandbox YAML example is presented as if it were valid
current v1beta1 API, but it uses draft fields like spec.suspensionStrategy and
hibernate.snapshotProviderRef that are not in the CRD. Update the example text
around the Sandbox manifest to clearly label it as proposed/draft future schema,
and apply the same disclaimer to the later Option B examples so readers do not
copy invalid current-api YAML. Use the surrounding Sandbox and Option B example
sections to make the schema status explicit without changing the illustrative
structure.
- Line 566: The current wording incorrectly merges the two backend-selection
flows: Option A resolves a SnapshotProvider directly, while Option B goes
through a resolved SnapshotClaim. Update the README section around the
SnapshotClaim/SnapshotClaimTemplate discussion to keep these paths separate,
explicitly describing Option A as provider resolution and Option B as
claim-based resolution. Reference the SnapshotClaim, SnapshotClaimTemplate, and
SnapshotProvider terms so the distinction stays clear and the architecture flow
does not imply all checkpoint paths resolve through a claim.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8ea991f5-c31b-42b0-a30d-6da28fd2fcb6

📥 Commits

Reviewing files that changed from the base of the PR and between da3b568 and dc6a676.

📒 Files selected for processing (2)
  • docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md
  • docs/keps/694-suspend-and-resume-plus-snapshot-api/kep.yaml
Comment on lines +210 to +223
**Developer/Agent references the provider in Sandbox:**
```yaml
apiVersion: agents.x-k8s.io/v1beta1
kind: Sandbox
metadata:
name: billing-agent-42
spec:
operatingMode: "Suspended"
suspensionStrategy:
type: "Hibernate"
hibernate:
snapshotProviderRef:
name: "s3-fast"
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Mark the example as proposed schema, not current API.

spec.suspensionStrategy and hibernate.snapshotProviderRef are not present in the current Sandbox v1beta1 CRD, so this YAML will fail validation if copied as-is. Please label it explicitly as draft/future schema, and apply the same caveat to the later Option B examples.

📝 Proposed fix
+> **Proposed schema:** the following fields are not available in the current `Sandbox` v1beta1 CRD.
 ```yaml
 apiVersion: agents.x-k8s.io/v1beta1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Developer/Agent references the provider in Sandbox:**
```yaml
apiVersion: agents.x-k8s.io/v1beta1
kind: Sandbox
metadata:
name: billing-agent-42
spec:
operatingMode: "Suspended"
suspensionStrategy:
type: "Hibernate"
hibernate:
snapshotProviderRef:
name: "s3-fast"
```
> **Proposed schema:** the following fields are not available in the current `Sandbox` v1beta1 CRD.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md` around lines
210 - 223, The Sandbox YAML example is presented as if it were valid current
v1beta1 API, but it uses draft fields like spec.suspensionStrategy and
hibernate.snapshotProviderRef that are not in the CRD. Update the example text
around the Sandbox manifest to clearly label it as proposed/draft future schema,
and apply the same disclaimer to the later Option B examples so readers do not
copy invalid current-api YAML. Use the surrounding Sandbox and Option B example
sections to make the schema status explicit without changing the illustrative
structure.

Source: Path instructions

Comment on lines +236 to +240
User->>KubeAPIServer: Update Sandbox (operatingMode = Suspended, providerRef = s3-fast)
Controller->>KubeAPIServer: Watch Event: Sandbox updated
Controller->>KubeAPIServer: Fetch SnapshotProvider 's3-fast'
KubeAPIServer-->>Controller: Return SnapshotProvider CR (with strongly-typed spec.s3Config)
Controller->>Driver: Route checkpoint request with S3 configurations

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Use the field name from the schema.

SnapshotProviderSpec defines s3, not s3Config, so this diagram currently points at a property that does not exist.

💡 Proposed fix
-    KubeAPIServer-->>Controller: Return SnapshotProvider CR (with strongly-typed spec.s3Config)
+    KubeAPIServer-->>Controller: Return SnapshotProvider CR (with strongly-typed spec.s3)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
User->>KubeAPIServer: Update Sandbox (operatingMode = Suspended, providerRef = s3-fast)
Controller->>KubeAPIServer: Watch Event: Sandbox updated
Controller->>KubeAPIServer: Fetch SnapshotProvider 's3-fast'
KubeAPIServer-->>Controller: Return SnapshotProvider CR (with strongly-typed spec.s3Config)
Controller->>Driver: Route checkpoint request with S3 configurations
User->>KubeAPIServer: Update Sandbox (operatingMode = Suspended, providerRef = s3-fast)
Controller->>KubeAPIServer: Watch Event: Sandbox updated
Controller->>KubeAPIServer: Fetch SnapshotProvider 's3-fast'
KubeAPIServer-->>Controller: Return SnapshotProvider CR (with strongly-typed spec.s3)
Controller->>Driver: Route checkpoint request with S3 configurations
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md` around lines
236 - 240, The diagram text uses a non-existent SnapshotProviderSpec property
name; update the controller response description to match the schema field.
Replace the reference to spec.s3Config with SnapshotProviderSpec.s3 in the
relevant sequence step so the docs align with the actual API model.

Source: Path instructions

Controller->>KubeAPIServer: Update Sandbox Status (Condition Suspended = True)
```

> A `SnapshotClaim` referenced by `snapshotClaimName` is **shared**: every Sandbox that names it points at the same claim and the same storage target. A `SnapshotClaimTemplate` referenced by `snapshotClaimTemplateName` is **generative**: the controller stamps out a distinct `SnapshotClaim` per Sandbox by copying the template's `spec` verbatim, sets the Sandbox as its owner (so it is garbage-collected with the Sandbox), and lets the driver derive a per-Sandbox storage key from the Sandbox identity. Either way the checkpoint path only ever reads a resolved `SnapshotClaim`, so the class-resolution flow above is identical — the only difference is whether that claim is shared or generated. This mirrors DRA's `resourceClaimName` (shared) vs `resourceClaimTemplateName` (one generated `ResourceClaim` per Pod) distinction.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Keep Option A and Option B resolution paths distinct.

This note says every checkpoint path reads a resolved SnapshotClaim, but Option A resolves SnapshotProvider directly and never introduces a claim. As written, the architecture flow collapses two different backend-selection paths into one.

🛠 Proposed fix
- Either way the checkpoint path only ever reads a resolved SnapshotClaim, so the class-resolution flow above is identical — the only difference is whether that claim is shared or generated.
+ Either way the checkpoint path first resolves the backend configuration object, then performs checkpointing. Option A resolves a SnapshotProvider directly; Option B resolves a SnapshotClaim (shared or generated) and then its SnapshotClass.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
> A `SnapshotClaim` referenced by `snapshotClaimName` is **shared**: every Sandbox that names it points at the same claim and the same storage target. A `SnapshotClaimTemplate` referenced by `snapshotClaimTemplateName` is **generative**: the controller stamps out a distinct `SnapshotClaim` per Sandbox by copying the template's `spec` verbatim, sets the Sandbox as its owner (so it is garbage-collected with the Sandbox), and lets the driver derive a per-Sandbox storage key from the Sandbox identity. Either way the checkpoint path only ever reads a resolved `SnapshotClaim`, so the class-resolution flow above is identical — the only difference is whether that claim is shared or generated. This mirrors DRA's `resourceClaimName` (shared) vs `resourceClaimTemplateName` (one generated `ResourceClaim` per Pod) distinction.
> A `SnapshotClaim` referenced by `snapshotClaimName` is **shared**: every Sandbox that names it points at the same claim and the same storage target. A `SnapshotClaimTemplate` referenced by `snapshotClaimTemplateName` is **generative**: the controller stamps out a distinct `SnapshotClaim` per Sandbox by copying the template's `spec` verbatim, sets the Sandbox as its owner (so it is garbage-collected with the Sandbox), and lets the driver derive a per-Sandbox storage key from the Sandbox identity. Either way the checkpoint path first resolves the backend configuration object, then performs checkpointing. Option A resolves a `SnapshotProvider` directly; Option B resolves a `SnapshotClaim` (shared or generated) and then its `SnapshotClass`. This mirrors DRA's `resourceClaimName` (shared) vs `resourceClaimTemplateName` (one generated `ResourceClaim` per Pod) distinction.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md` at line 566,
The current wording incorrectly merges the two backend-selection flows: Option A
resolves a SnapshotProvider directly, while Option B goes through a resolved
SnapshotClaim. Update the README section around the
SnapshotClaim/SnapshotClaimTemplate discussion to keep these paths separate,
explicitly describing Option A as provider resolution and Option B as
claim-based resolution. Reference the SnapshotClaim, SnapshotClaimTemplate, and
SnapshotProvider terms so the distinction stays clear and the architecture flow
does not imply all checkpoint paths resolve through a claim.

Source: Path instructions

Comment on lines +759 to +765
When `spec.operatingMode` is set to `Running` (or omitted), the controller executes the following logic sequence during its reconciliation loop:

- **Observe Current State:** The controller queries the API server to check for the existence of the underlying runner Pod associated with the AgentSandbox.
- **Evaluate Restore Necessity:** If no active runner Pod is found, the controller checks if a snapshot of the memory/filesystem state exists for this Sandbox.
- **Reconcile Discrepancy (Restore or Boot):**
- **Stateful Restore (Hibernate/Freeze):** If a snapshot exists, the controller reads the `snapshotClassName` configuration referenced in `spec.suspensionStrategy` to retrieve storage backend parameters. It coordinates with the pluggable snapshot driver to restore the live process memory and filesystem state into a newly provisioned runner Pod shell.
- **Cold Boot (Stop / Default):** If no snapshot exists, the controller invokes the Pod driver to construct a fresh runner Pod shell from scratch using the Sandbox's original `spec.podTemplate` and binds it to the existing Persistent Volume Claims (PVCs).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Resolve the backend object before restore.

The resume flow currently reads snapshotClassName from spec.suspensionStrategy, but that field is not part of the proposed strategy schema. The controller needs to resolve the selected SnapshotProvider or SnapshotClaim first, then use that object’s configuration.

🧭 Proposed fix
-  - **Stateful Restore (Hibernate/Freeze):** If a snapshot exists, the controller reads the `snapshotClassName` configuration referenced in `spec.suspensionStrategy` to retrieve storage backend parameters. It coordinates with the pluggable snapshot driver to restore the live process memory and filesystem state into a newly provisioned runner Pod shell.
+  - **Stateful Restore (Hibernate/Freeze):** If a snapshot exists, the controller resolves the selected `SnapshotProvider` (Option A) or `SnapshotClaim`/`SnapshotClass` pair (Option B) to retrieve storage backend parameters. It coordinates with the pluggable snapshot driver to restore the live process memory and filesystem state into a newly provisioned runner Pod shell.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
When `spec.operatingMode` is set to `Running` (or omitted), the controller executes the following logic sequence during its reconciliation loop:
- **Observe Current State:** The controller queries the API server to check for the existence of the underlying runner Pod associated with the AgentSandbox.
- **Evaluate Restore Necessity:** If no active runner Pod is found, the controller checks if a snapshot of the memory/filesystem state exists for this Sandbox.
- **Reconcile Discrepancy (Restore or Boot):**
- **Stateful Restore (Hibernate/Freeze):** If a snapshot exists, the controller reads the `snapshotClassName` configuration referenced in `spec.suspensionStrategy` to retrieve storage backend parameters. It coordinates with the pluggable snapshot driver to restore the live process memory and filesystem state into a newly provisioned runner Pod shell.
- **Cold Boot (Stop / Default):** If no snapshot exists, the controller invokes the Pod driver to construct a fresh runner Pod shell from scratch using the Sandbox's original `spec.podTemplate` and binds it to the existing Persistent Volume Claims (PVCs).
When `spec.operatingMode` is set to `Running` (or omitted), the controller executes the following logic sequence during its reconciliation loop:
- **Observe Current State:** The controller queries the API server to check for the existence of the underlying runner Pod associated with the AgentSandbox.
- **Evaluate Restore Necessity:** If no active runner Pod is found, the controller checks if a snapshot of the memory/filesystem state exists for this Sandbox.
- **Reconcile Discrepancy (Restore or Boot):**
- **Stateful Restore (Hibernate/Freeze):** If a snapshot exists, the controller resolves the selected `SnapshotProvider` (Option A) or `SnapshotClaim`/`SnapshotClass` pair (Option B) to retrieve storage backend parameters. It coordinates with the pluggable snapshot driver to restore the live process memory and filesystem state into a newly provisioned runner Pod shell.
- **Cold Boot (Stop / Default):** If no snapshot exists, the controller invokes the Pod driver to construct a fresh runner Pod shell from scratch using the Sandbox's original `spec.podTemplate` and binds it to the existing Persistent Volume Claims (PVCs).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/keps/694-suspend-and-resume-plus-snapshot-api/README.md` around lines
759 - 765, The resume flow in the Running reconciliation logic currently pulls
snapshot configuration directly from `spec.suspensionStrategy`, but the strategy
schema does not define `snapshotClassName`. Update the restore path in the
controller logic to first resolve the selected backend object, either the
relevant `SnapshotProvider` or `SnapshotClaim`, and then read that object’s
configuration for restore parameters. Make this change in the reconciliation
sequence that handles snapshot restore versus cold boot so the restore step uses
the backend object instead of an unsupported strategy field.

Source: Path instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action-required: resolve-copilot-comments cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

4 participants