Commit 520159c
[RayService][Kueue] Support top-level Spec.Suspend for zero-downtime upgrade (#4841)
* [RayService] Apply Suspend to RayCluster only at creation time
Propagate RayService.Spec.RayClusterSpec.Suspend onto the RayCluster only
when the RayCluster is first created. After the RayCluster exists, its
Suspend is delegated to Kueue: hash comparisons ignore Suspend, and
modifyRayCluster preserves the existing cluster's Suspend instead of
overwriting it with the RayService spec.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Add top-level Spec.Suspend to tear down owned resources
When Spec.Suspend is true, the RayService controller deletes every
Kubernetes resource it owns (RayClusters, head/serve Services, and
Gateway/HTTPRoute when the RayServiceIncrementalUpgrade gate is on) and
reports the lifecycle through two new conditions, Suspending and
Suspended. The transition is atomic: the first reconcile commits
Suspending=True together with the reset of ActiveServiceStatus,
PendingServiceStatus, NumServeEndpoints, and ServiceStatus in a single
Status update; deletion runs on the next reconcile once that commit is
durable, so an errored or interrupted attempt is always resumed.
Flipping Spec.Suspend back to false removes the Suspended condition and
the regular reconcile recreates the resources.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Remove suspend handler unit tests
The suspend behavior is exercised end-to-end on a kind cluster, so the
handler-level unit tests are removed for now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Sync top-level Spec.Suspend into Helm chart CRD
Run `make helm` so the helm-chart/kuberay-operator/crds copy of the
RayService CRD matches config/crd/bases after adding Spec.Suspend.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Drop create-only Suspend propagation unit tests
These tests reached into private helpers and duplicated coverage already
exercised by the zero-downtime upgrade + Suspend e2e walkthrough; drop
them to reduce coupling to internal structure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Add e2e tests for Spec.Suspend lifecycle
Covers four scenarios:
- Suspend a Running service then resume it.
- Atomic suspend: deletion completes even if Spec.Suspend is flipped
back to false mid-suspend; service then exits Suspended and serves
traffic again.
- Service created with Spec.Suspend=true: never spins up resources,
reaches Suspended directly, comes up normally on resume.
- Suspend during a zero-downtime upgrade: both active and pending
clusters are deleted; resuming applies the upgraded spec.
The atomic case surfaced a bug where the controller would stay in
Suspended forever if Spec.Suspend had been flipped to false before the
transition landed: after persisting Suspended=True we returned
ctrl.Result{} with no requeue, and the status-only update did not
re-trigger the watch predicate. Fix by requeuing after the transition so
the next reconcile observes Spec.Suspend and exits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Fix lint: trailing newline and import grouping
- Remove trailing blank line at end of rayservice_controller_unit_test.go.
- Group corev1 with the other k8s.io imports in
rayservice_suspend_test.go so goimports leaves it alone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Use ExecPodCmdWithError directly in suspend e2e
Drop the curlRayServiceFruitWithError wrapper; it duplicated what
ExecPodCmdWithError already does. Build the curl command inline at the
single call site.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Harden TestRayServiceSuspendAtomic against transient state
Stop asserting that Suspended=True is observed during the atomic flow.
That condition is only persisted for ~2s (one requeue interval) after
deletion completes if Spec.Suspend has already been flipped to false,
which made the test vulnerable to scheduling jitter under load.
Instead record the original RayCluster name before suspending and
assert (1) that cluster is deleted and (2) the eventually-Ready
RayService is backed by a different RayCluster. This proves atomic
completion more directly: the underlying cluster was actually torn down
and recreated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Mirror RayJob's atomic-suspend test in TestRayServiceSuspendAtomic
Pin the underlying RayCluster with a synthetic finalizer so deletion
cannot complete, then flip Spec.Suspend back and forth while asserting
via Consistently that Suspending stays True. This matches the
"RayJob suspend operation shoud be atomic" pattern in
rayjob_controller_suspended_test.go and exercises the atomicity
property directly instead of inferring it from cluster recreation.
After removing the finalizer the test still verifies that the original
RayCluster is deleted and a different RayCluster eventually backs the
Ready RayService.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Drop redundant comment above handleSuspend call
The function name and its own doc comment already convey what the call
site does; the inline comment was duplicative.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Drop redundant create-time Suspend comment
The comment block above `clusterSpec := rayService.Spec.RayClusterSpec.DeepCopy()`
in constructRayClusterForRayService restated what is already documented on
rayClusterSpecForHashing and modifyRayCluster.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Split suspend e2e into its own target + fix curl hang
The full ./test/e2erayservice suite was nearly exhausting the 30m
Go-test timeout, and TestRayServiceSuspendResume was failing because a
single curl attempt hung on TCP retransmits past TestTimeoutShort —
Eventually couldn't retry. Two fixes:
- Split the suspend tests into their own Make target
(test-e2e-rayservice-suspend) and Buildkite job; the existing
test-e2e-rayservice / rayservice job runs with `-skip Suspend`. Each
job gets its own 30m budget.
- Add --connect-timeout 3 --max-time 5 to the resumed-service curl so
each attempt fails fast and Eventually can actually iterate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Drop Gateway/HTTPRoute deletion from suspend handler
Suspend tear-down should only touch the resources that exist outside
the incremental-upgrade feature. Remove the gated Gateway/HTTPRoute
branch in deleteRayServiceOwnedResources, the new FailedToDeleteGateway
and FailedToDeleteHTTPRoute event types, and drop Gateway/HTTPRoute
mentions from condition messages and the Spec.Suspend doc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* sync
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
* [RayService] Retain suspend conditions and switch to suspendIsOperative
Two related polish-ups to handleSuspend:
1. Replace meta.RemoveStatusCondition calls for RayServiceSuspending and
RayServiceSuspended with setCondition(False, ...). Per the K8s API
convention, absent conditions are interpreted as Unknown — but the
controller does know these are False at each former-removal site
(reaching Suspended, exiting Suspended). Setting False preserves
lastTransitionTime, reason, and message, and keeps
kube_*_status_condition timeseries from gapping. Add a new
RayServiceResumed reason constant for the Suspended=False transition
when Spec.Suspend is flipped back to false.
2. Drop the handled bool return from handleSuspend. Reconcile now
short-circuits via a new suspendIsOperative helper that reads the
Suspending / Suspended conditions handleSuspend just staged. Those
two conditions are already the source of truth for the suspend state
machine, so the bool was a redundant cache. Reduces the signature
from (bool, ctrl.Result, error) to (ctrl.Result, error).
Add two e2e assertions in TestRayServiceSuspendResume that catch the
condition-removal regression directly: after suspend completes,
Suspending must remain as False/SuspendComplete; after resume, Suspended
must remain as False/RayServiceResumed. The existing
IsRayServiceSuspended-based assertions cannot distinguish absent from
False, so this is the regression guard.
Verified end-to-end against kind-kueue-rayservice with all four suspend
e2e tests (SuspendResume, SuspendAtomic, CreatedSuspended,
SuspendDuringUpgrade) passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [RayService] Retain Upgrade/Rollback conditions through suspend lifecycle
Extend the prior commit's "retain conditions as False instead of
removing" treatment to UpgradeInProgress and RollbackInProgress, which
were still meta.RemoveStatusCondition'd when entering Suspending.
Two changes in handleSuspend:
1. When entering Suspending, set Upgrade/RollbackInProgress to False
with reason SuspendInProgress and a tense-neutral message ("No upgrade
in progress.") instead of removing them. Neutral wording stays
accurate at the Suspended terminal state too.
2. When transitioning to Suspended, re-stamp Upgrade/RollbackInProgress
with reason SuspendComplete to match the existing re-stamp of Ready
at that point. After this, every condition in the Suspended terminal
state consistently shows reason SuspendComplete, which reads cleanly
in kubectl describe.
Extend the existing Suspended-state e2e assertion in
TestRayServiceSuspendResume to loop over Suspending / UpgradeInProgress
/ RollbackInProgress so the new contract is regression-guarded by the
same mechanism that already guards Suspending.
Verified end-to-end against kind-kueue-rayservice; all four suspend
e2e tests pass, including TestRayServiceSuspendDuringUpgrade which
exercises the non-trivial case where UpgradeInProgress was actually
True before suspend was triggered.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 798089c commit 520159c
12 files changed
Lines changed: 630 additions & 89 deletions
File tree
- .buildkite
- docs/reference
- helm-chart/kuberay-operator/crds
- ray-operator
- apis/ray/v1
- config/crd/bases
- controllers/ray
- utils
- pkg/client/applyconfiguration/ray/v1
- test
- e2erayservice
- support
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
58 | | - | |
| 58 | + | |
59 | 59 | | |
60 | 60 | | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
61 | 81 | | |
62 | 82 | | |
63 | 83 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
507 | 507 | | |
508 | 508 | | |
509 | 509 | | |
| 510 | + | |
510 | 511 | | |
511 | 512 | | |
512 | 513 | | |
| |||
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
81 | 81 | | |
82 | 82 | | |
83 | 83 | | |
84 | | - | |
85 | | - | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
86 | 90 | | |
87 | 91 | | |
88 | 92 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
122 | 122 | | |
123 | 123 | | |
124 | 124 | | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
125 | 131 | | |
126 | 132 | | |
127 | 133 | | |
| |||
209 | 215 | | |
210 | 216 | | |
211 | 217 | | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
212 | 223 | | |
213 | 224 | | |
214 | 225 | | |
| |||
221 | 232 | | |
222 | 233 | | |
223 | 234 | | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
224 | 239 | | |
225 | 240 | | |
226 | 241 | | |
| |||
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
0 commit comments