Skip to content

[bug] SandboxClaim can adopt stale pod version during SandboxWarmPool Recreate on in-place Template Update #764

Description

@vicentefb

When a SandboxTemplate is updated in-place (i.e., the spec is changed but the metadata.name remains the same), and a SandboxWarmPool referencing this template uses the Recreate update strategy, there is a window during the update where a new SandboxClaim can be incorrectly bound to a Sandbox instance based on the previous version of the template.

The Recreate strategy works by first deleting all existing Sandbox instances (and their underlying Pods) managed by the pool, and only then creating new instances based on the updated SandboxTemplate.

The potential issue arises because the deletion of old pods is not instantaneous. If a new SandboxClaim is created and processed after the SandboxTemplate has been updated but before all old pods are fully terminated, the SandboxClaim controller might still select one of these old, not-yet-deleted pods. This leads to the claim being fulfilled with a Sandbox running an outdated configuration.

Steps to Reproduce:

  1. Create a SandboxTemplate (e.g., my-template v1).
  2. Create a SandboxWarmPool referencing my-template with spec.strategy.type: Recreate. Wait for pods to become ready.
  3. Update the SandboxTemplate my-template in-place with a new configuration (e.g., different image tag, environment variable - v2).
  4. Immediately create a SandboxClaim referencing my-template.
  5. Observe the Sandbox instance bound to the claim. There is a chance the underlying pod reflects the v1 configuration, not v2. The larger the SandboxWarmPool, the longer the total time it takes for the Recreate strategy to delete all the old pods. This widens the window of vulnerability.

Expected Behavior:

A SandboxClaim created after a SandboxTemplate has been updated should only be bound to Sandbox instances that reflect the updated template definition. If no instances from the new version are available in the warm pool (which is expected during a Recreate), the claim should trigger a cold start of a new Sandbox based on the current template, or wait for the warm pool to be repopulated with updated instances. It should never adopt a pod based on a stale template spec.

Actual Behavior:

A new SandboxClaim may be bound to a Sandbox instance based on the old SandboxTemplate definition if an old pod hasn't been fully terminated yet during the Recreate process.

Impact:

This leads to version inconsistency and unexpected behavior for users, as they would expect the Sandbox to conform to the latest definition of the referenced SandboxTemplate.

Possible Solution:

The SandboxClaim controller's logic for selecting an available pod from the warm pool needs to ensure version consistency. Pods created by the SandboxWarmPool are labeled with agents.x-k8s.io/sandbox-pod-template-hash, which is derived from the SandboxTemplate's podTemplate spec.

The SandboxClaim controller can:

  1. When reconciling, fetch the current version of the referenced SandboxTemplate.
  2. Calculate the expected pod template hash from this current template's spec.
  3. When querying for available pods from the warm pool, filter not only by the warm pool labels but also ensure that the pod's agents.x-k8s.io/sandbox-pod-template-hash label value matches the hash calculated in step 2.

This stricter selection criteria will prevent the SandboxClaim controller from adopting pods that were created from a previous version of the SandboxTemplate (race condition), effectively ensuring that only instances matching the current template spec are considered. Old pods, even if still terminating, will have a different hash and will be ignored.

Components:

  • SandboxClaim controller: Needs to implement or verify the hash-based selection logic.
  • SandboxWarmPool controller: Ensures the hash label is correctly applied to pods it creates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    Linked

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions