When a SandboxTemplate is updated in-place (i.e., the spec is changed but the metadata.name remains the same), and a SandboxWarmPool referencing this template uses the Recreate update strategy, there is a window during the update where a new SandboxClaim can be incorrectly bound to a Sandbox instance based on the previous version of the template.
The Recreate strategy works by first deleting all existing Sandbox instances (and their underlying Pods) managed by the pool, and only then creating new instances based on the updated SandboxTemplate.
The potential issue arises because the deletion of old pods is not instantaneous. If a new SandboxClaim is created and processed after the SandboxTemplate has been updated but before all old pods are fully terminated, the SandboxClaim controller might still select one of these old, not-yet-deleted pods. This leads to the claim being fulfilled with a Sandbox running an outdated configuration.
Steps to Reproduce:
- Create a
SandboxTemplate (e.g., my-template v1).
- Create a
SandboxWarmPool referencing my-template with spec.strategy.type: Recreate. Wait for pods to become ready.
- Update the
SandboxTemplate my-template in-place with a new configuration (e.g., different image tag, environment variable - v2).
- Immediately create a
SandboxClaim referencing my-template.
- Observe the Sandbox instance bound to the claim. There is a chance the underlying pod reflects the v1 configuration, not v2. The larger the SandboxWarmPool, the longer the total time it takes for the Recreate strategy to delete all the old pods. This widens the window of vulnerability.
Expected Behavior:
A SandboxClaim created after a SandboxTemplate has been updated should only be bound to Sandbox instances that reflect the updated template definition. If no instances from the new version are available in the warm pool (which is expected during a Recreate), the claim should trigger a cold start of a new Sandbox based on the current template, or wait for the warm pool to be repopulated with updated instances. It should never adopt a pod based on a stale template spec.
Actual Behavior:
A new SandboxClaim may be bound to a Sandbox instance based on the old SandboxTemplate definition if an old pod hasn't been fully terminated yet during the Recreate process.
Impact:
This leads to version inconsistency and unexpected behavior for users, as they would expect the Sandbox to conform to the latest definition of the referenced SandboxTemplate.
Possible Solution:
The SandboxClaim controller's logic for selecting an available pod from the warm pool needs to ensure version consistency. Pods created by the SandboxWarmPool are labeled with agents.x-k8s.io/sandbox-pod-template-hash, which is derived from the SandboxTemplate's podTemplate spec.
The SandboxClaim controller can:
- When reconciling, fetch the current version of the referenced
SandboxTemplate.
- Calculate the expected pod template hash from this current template's spec.
- When querying for available pods from the warm pool, filter not only by the warm pool labels but also ensure that the pod's
agents.x-k8s.io/sandbox-pod-template-hash label value matches the hash calculated in step 2.
This stricter selection criteria will prevent the SandboxClaim controller from adopting pods that were created from a previous version of the SandboxTemplate (race condition), effectively ensuring that only instances matching the current template spec are considered. Old pods, even if still terminating, will have a different hash and will be ignored.
Components:
SandboxClaim controller: Needs to implement or verify the hash-based selection logic.
SandboxWarmPool controller: Ensures the hash label is correctly applied to pods it creates.
When a
SandboxTemplateis updated in-place (i.e., thespecis changed but themetadata.nameremains the same), and aSandboxWarmPoolreferencing this template uses theRecreateupdate strategy, there is a window during the update where a newSandboxClaimcan be incorrectly bound to a Sandbox instance based on the previous version of the template.The
Recreatestrategy works by first deleting all existing Sandbox instances (and their underlying Pods) managed by the pool, and only then creating new instances based on the updatedSandboxTemplate.The potential issue arises because the deletion of old pods is not instantaneous. If a new
SandboxClaimis created and processed after theSandboxTemplatehas been updated but before all old pods are fully terminated, theSandboxClaimcontroller might still select one of these old, not-yet-deleted pods. This leads to the claim being fulfilled with a Sandbox running an outdated configuration.Steps to Reproduce:
SandboxTemplate(e.g.,my-templatev1).SandboxWarmPoolreferencingmy-templatewithspec.strategy.type: Recreate. Wait for pods to become ready.SandboxTemplatemy-templatein-place with a new configuration (e.g., different image tag, environment variable - v2).SandboxClaimreferencingmy-template.Expected Behavior:
A
SandboxClaimcreated after aSandboxTemplatehas been updated should only be bound to Sandbox instances that reflect the updated template definition. If no instances from the new version are available in the warm pool (which is expected during aRecreate), the claim should trigger a cold start of a new Sandbox based on the current template, or wait for the warm pool to be repopulated with updated instances. It should never adopt a pod based on a stale template spec.Actual Behavior:
A new
SandboxClaimmay be bound to a Sandbox instance based on the oldSandboxTemplatedefinition if an old pod hasn't been fully terminated yet during theRecreateprocess.Impact:
This leads to version inconsistency and unexpected behavior for users, as they would expect the Sandbox to conform to the latest definition of the referenced
SandboxTemplate.Possible Solution:
The
SandboxClaimcontroller's logic for selecting an available pod from the warm pool needs to ensure version consistency. Pods created by theSandboxWarmPoolare labeled withagents.x-k8s.io/sandbox-pod-template-hash, which is derived from theSandboxTemplate'spodTemplatespec.The
SandboxClaimcontroller can:SandboxTemplate.agents.x-k8s.io/sandbox-pod-template-hashlabel value matches the hash calculated in step 2.This stricter selection criteria will prevent the
SandboxClaimcontroller from adopting pods that were created from a previous version of theSandboxTemplate(race condition), effectively ensuring that only instances matching the current template spec are considered. Old pods, even if still terminating, will have a different hash and will be ignored.Components:
SandboxClaimcontroller: Needs to implement or verify the hash-based selection logic.SandboxWarmPoolcontroller: Ensures the hash label is correctly applied to pods it creates.