Skip to content

Commit 520159c

Browse files
kevin85421claude
andauthored
[RayService][Kueue] Support top-level Spec.Suspend for zero-downtime upgrade (#4841)
* [RayService] Apply Suspend to RayCluster only at creation time Propagate RayService.Spec.RayClusterSpec.Suspend onto the RayCluster only when the RayCluster is first created. After the RayCluster exists, its Suspend is delegated to Kueue: hash comparisons ignore Suspend, and modifyRayCluster preserves the existing cluster's Suspend instead of overwriting it with the RayService spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Add top-level Spec.Suspend to tear down owned resources When Spec.Suspend is true, the RayService controller deletes every Kubernetes resource it owns (RayClusters, head/serve Services, and Gateway/HTTPRoute when the RayServiceIncrementalUpgrade gate is on) and reports the lifecycle through two new conditions, Suspending and Suspended. The transition is atomic: the first reconcile commits Suspending=True together with the reset of ActiveServiceStatus, PendingServiceStatus, NumServeEndpoints, and ServiceStatus in a single Status update; deletion runs on the next reconcile once that commit is durable, so an errored or interrupted attempt is always resumed. Flipping Spec.Suspend back to false removes the Suspended condition and the regular reconcile recreates the resources. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Remove suspend handler unit tests The suspend behavior is exercised end-to-end on a kind cluster, so the handler-level unit tests are removed for now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Sync top-level Spec.Suspend into Helm chart CRD Run `make helm` so the helm-chart/kuberay-operator/crds copy of the RayService CRD matches config/crd/bases after adding Spec.Suspend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Drop create-only Suspend propagation unit tests These tests reached into private helpers and duplicated coverage already exercised by the zero-downtime upgrade + Suspend e2e walkthrough; drop them to reduce coupling to internal structure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Add e2e tests for Spec.Suspend lifecycle Covers four scenarios: - Suspend a Running service then resume it. - Atomic suspend: deletion completes even if Spec.Suspend is flipped back to false mid-suspend; service then exits Suspended and serves traffic again. - Service created with Spec.Suspend=true: never spins up resources, reaches Suspended directly, comes up normally on resume. - Suspend during a zero-downtime upgrade: both active and pending clusters are deleted; resuming applies the upgraded spec. The atomic case surfaced a bug where the controller would stay in Suspended forever if Spec.Suspend had been flipped to false before the transition landed: after persisting Suspended=True we returned ctrl.Result{} with no requeue, and the status-only update did not re-trigger the watch predicate. Fix by requeuing after the transition so the next reconcile observes Spec.Suspend and exits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Fix lint: trailing newline and import grouping - Remove trailing blank line at end of rayservice_controller_unit_test.go. - Group corev1 with the other k8s.io imports in rayservice_suspend_test.go so goimports leaves it alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Use ExecPodCmdWithError directly in suspend e2e Drop the curlRayServiceFruitWithError wrapper; it duplicated what ExecPodCmdWithError already does. Build the curl command inline at the single call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Harden TestRayServiceSuspendAtomic against transient state Stop asserting that Suspended=True is observed during the atomic flow. That condition is only persisted for ~2s (one requeue interval) after deletion completes if Spec.Suspend has already been flipped to false, which made the test vulnerable to scheduling jitter under load. Instead record the original RayCluster name before suspending and assert (1) that cluster is deleted and (2) the eventually-Ready RayService is backed by a different RayCluster. This proves atomic completion more directly: the underlying cluster was actually torn down and recreated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Mirror RayJob's atomic-suspend test in TestRayServiceSuspendAtomic Pin the underlying RayCluster with a synthetic finalizer so deletion cannot complete, then flip Spec.Suspend back and forth while asserting via Consistently that Suspending stays True. This matches the "RayJob suspend operation shoud be atomic" pattern in rayjob_controller_suspended_test.go and exercises the atomicity property directly instead of inferring it from cluster recreation. After removing the finalizer the test still verifies that the original RayCluster is deleted and a different RayCluster eventually backs the Ready RayService. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Drop redundant comment above handleSuspend call The function name and its own doc comment already convey what the call site does; the inline comment was duplicative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Drop redundant create-time Suspend comment The comment block above `clusterSpec := rayService.Spec.RayClusterSpec.DeepCopy()` in constructRayClusterForRayService restated what is already documented on rayClusterSpecForHashing and modifyRayCluster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Split suspend e2e into its own target + fix curl hang The full ./test/e2erayservice suite was nearly exhausting the 30m Go-test timeout, and TestRayServiceSuspendResume was failing because a single curl attempt hung on TCP retransmits past TestTimeoutShort — Eventually couldn't retry. Two fixes: - Split the suspend tests into their own Make target (test-e2e-rayservice-suspend) and Buildkite job; the existing test-e2e-rayservice / rayservice job runs with `-skip Suspend`. Each job gets its own 30m budget. - Add --connect-timeout 3 --max-time 5 to the resumed-service curl so each attempt fails fast and Eventually can actually iterate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Drop Gateway/HTTPRoute deletion from suspend handler Suspend tear-down should only touch the resources that exist outside the incremental-upgrade feature. Remove the gated Gateway/HTTPRoute branch in deleteRayServiceOwnedResources, the new FailedToDeleteGateway and FailedToDeleteHTTPRoute event types, and drop Gateway/HTTPRoute mentions from condition messages and the Spec.Suspend doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * sync Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org> * [RayService] Retain suspend conditions and switch to suspendIsOperative Two related polish-ups to handleSuspend: 1. Replace meta.RemoveStatusCondition calls for RayServiceSuspending and RayServiceSuspended with setCondition(False, ...). Per the K8s API convention, absent conditions are interpreted as Unknown — but the controller does know these are False at each former-removal site (reaching Suspended, exiting Suspended). Setting False preserves lastTransitionTime, reason, and message, and keeps kube_*_status_condition timeseries from gapping. Add a new RayServiceResumed reason constant for the Suspended=False transition when Spec.Suspend is flipped back to false. 2. Drop the handled bool return from handleSuspend. Reconcile now short-circuits via a new suspendIsOperative helper that reads the Suspending / Suspended conditions handleSuspend just staged. Those two conditions are already the source of truth for the suspend state machine, so the bool was a redundant cache. Reduces the signature from (bool, ctrl.Result, error) to (ctrl.Result, error). Add two e2e assertions in TestRayServiceSuspendResume that catch the condition-removal regression directly: after suspend completes, Suspending must remain as False/SuspendComplete; after resume, Suspended must remain as False/RayServiceResumed. The existing IsRayServiceSuspended-based assertions cannot distinguish absent from False, so this is the regression guard. Verified end-to-end against kind-kueue-rayservice with all four suspend e2e tests (SuspendResume, SuspendAtomic, CreatedSuspended, SuspendDuringUpgrade) passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [RayService] Retain Upgrade/Rollback conditions through suspend lifecycle Extend the prior commit's "retain conditions as False instead of removing" treatment to UpgradeInProgress and RollbackInProgress, which were still meta.RemoveStatusCondition'd when entering Suspending. Two changes in handleSuspend: 1. When entering Suspending, set Upgrade/RollbackInProgress to False with reason SuspendInProgress and a tense-neutral message ("No upgrade in progress.") instead of removing them. Neutral wording stays accurate at the Suspended terminal state too. 2. When transitioning to Suspended, re-stamp Upgrade/RollbackInProgress with reason SuspendComplete to match the existing re-stamp of Ready at that point. After this, every condition in the Suspended terminal state consistently shows reason SuspendComplete, which reads cleanly in kubectl describe. Extend the existing Suspended-state e2e assertion in TestRayServiceSuspendResume to loop over Suspending / UpgradeInProgress / RollbackInProgress so the new contract is regression-guarded by the same mechanism that already guards Suspending. Verified end-to-end against kind-kueue-rayservice; all four suspend e2e tests pass, including TestRayServiceSuspendDuringUpgrade which exercises the non-trivial case where UpgradeInProgress was actually True before suspend was triggered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 798089c commit 520159c

12 files changed

Lines changed: 630 additions & 89 deletions

File tree

‎.buildkite/test-e2e.yml‎

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,29 @@
5555
- set -o pipefail
5656
- mkdir -p "$(pwd)/tmp" && export KUBERAY_TEST_OUTPUT_DIR=$(pwd)/tmp
5757
- echo "KUBERAY_TEST_OUTPUT_DIR=$$KUBERAY_TEST_OUTPUT_DIR"
58-
- KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=5m KUBERAY_TEST_TIMEOUT_LONG=10m go test -timeout 30m -v ./test/e2erayservice 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-rayservice-log.tar -T - && exit 1)
58+
- KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=5m KUBERAY_TEST_TIMEOUT_LONG=10m go test -timeout 30m -v -skip Suspend ./test/e2erayservice 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-rayservice-log.tar -T - && exit 1)
5959
- echo "--- END:e2e rayservice (nightly operator) tests finished"
6060

61+
- label: 'Test E2E rayservice suspend (nightly operator)'
62+
instance_size: large
63+
image: golang:1.26-bookworm
64+
commands:
65+
- source .buildkite/setup-env.sh
66+
- kind create cluster --wait 900s --config ./ci/kind-config-buildkite.yml
67+
- kubectl config set clusters.kind-kind.server https://docker:6443
68+
# Build nightly KubeRay operator image
69+
- pushd ray-operator
70+
- source ../.buildkite/build-start-operator.sh
71+
- kubectl wait --timeout=90s --for=condition=Available=true deployment kuberay-operator
72+
# Run suspend e2e tests and print KubeRay operator logs if tests fail
73+
- echo "--- START:Running e2e rayservice suspend (nightly operator) tests"
74+
- if [ -n "$${KUBERAY_TEST_RAY_IMAGE}" ]; then echo "Using Ray Image $${KUBERAY_TEST_RAY_IMAGE}"; fi
75+
- set -o pipefail
76+
- mkdir -p "$(pwd)/tmp" && export KUBERAY_TEST_OUTPUT_DIR=$(pwd)/tmp
77+
- echo "KUBERAY_TEST_OUTPUT_DIR=$$KUBERAY_TEST_OUTPUT_DIR"
78+
- KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=5m KUBERAY_TEST_TIMEOUT_LONG=10m go test -timeout 30m -v -run Suspend ./test/e2erayservice 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-rayservice-suspend-log.tar -T - && exit 1)
79+
- echo "--- END:e2e rayservice suspend (nightly operator) tests finished"
80+
6181
- label: 'Test RayService Incremental Upgrade E2E (nightly operator)'
6282
instance_size: large
6383
image: golang:1.26-bookworm

‎docs/reference/api.md‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -507,6 +507,7 @@ _Appears in:_
507507
| `serveConfigV2` _string_ | Important: Run "make" to regenerate code after modifying this file<br />Defines the applications and deployments to deploy, should be a YAML multi-line scalar string. | | |
508508
| `rayClusterConfig` _[RayClusterSpec](#rayclusterspec)_ | | | |
509509
| `excludeHeadPodFromServeSvc` _boolean_ | If the field is set to true, the value of the label `ray.io/serve` on the head Pod should always be false.<br />Therefore, the head Pod's endpoint will not be added to the Kubernetes Serve service. | | |
510+
| `suspend` _boolean_ | Suspend indicates whether the RayService should suspend its execution. When set to true,<br />all Kubernetes resources owned by the RayService controller (RayClusters and Kubernetes<br />Services) will be deleted. Setting it back to false will allow the RayService controller<br />to recreate the resources. | | |
510511

511512

512513

‎helm-chart/kuberay-operator/crds/ray.io_rayservices.yaml‎

Lines changed: 2 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

‎ray-operator/Makefile‎

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,8 +81,12 @@ test-e2e-autoscaler: manifests fmt vet ## Run e2e autoscaler tests.
8181
go test -timeout 30m -v $(WHAT)
8282

8383
test-e2e-rayservice: WHAT ?= ./test/e2erayservice
84-
test-e2e-rayservice: manifests fmt vet ## Run e2e RayService tests.
85-
go test -timeout 30m -v $(WHAT)
84+
test-e2e-rayservice: manifests fmt vet ## Run e2e RayService tests (excluding suspend tests, which have their own target).
85+
go test -timeout 30m -v -skip Suspend $(WHAT)
86+
87+
test-e2e-rayservice-suspend: WHAT ?= ./test/e2erayservice
88+
test-e2e-rayservice-suspend: manifests fmt vet ## Run e2e RayService suspend tests.
89+
go test -timeout 30m -v -run Suspend $(WHAT)
8690

8791
test-e2e-upgrade: WHAT ?= ./test/e2eupgrade
8892
test-e2e-upgrade: manifests fmt vet ## Run e2e operator upgrade tests.

‎ray-operator/apis/ray/v1/rayservice_types.go‎

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,12 @@ type RayServiceSpec struct {
122122
// Therefore, the head Pod's endpoint will not be added to the Kubernetes Serve service.
123123
// +optional
124124
ExcludeHeadPodFromServeSvc bool `json:"excludeHeadPodFromServeSvc,omitempty"`
125+
// Suspend indicates whether the RayService should suspend its execution. When set to true,
126+
// all Kubernetes resources owned by the RayService controller (RayClusters and Kubernetes
127+
// Services) will be deleted. Setting it back to false will allow the RayService controller
128+
// to recreate the resources.
129+
// +optional
130+
Suspend bool `json:"suspend,omitempty"`
125131
}
126132

127133
// RayServiceStatuses defines the observed state of RayService
@@ -209,6 +215,11 @@ const (
209215
UpgradeInProgress RayServiceConditionType = "UpgradeInProgress"
210216
// RollbackInProgress means the RayService is currently rolling back an in-progress upgrade to the original cluster state.
211217
RollbackInProgress RayServiceConditionType = "RollbackInProgress"
218+
// RayServiceSuspending means the RayService is in the middle of deleting its owned resources in response to Spec.Suspend.
219+
// Once entered, the suspend operation completes atomically regardless of later changes to Spec.Suspend.
220+
RayServiceSuspending RayServiceConditionType = "Suspending"
221+
// RayServiceSuspended means all resources owned by the RayService controller have been deleted and the RayService is suspended.
222+
RayServiceSuspended RayServiceConditionType = "Suspended"
212223
)
213224

214225
const (
@@ -221,6 +232,10 @@ const (
221232
NoActiveCluster RayServiceConditionReason = "NoActiveCluster"
222233
RayServiceValidationFailed RayServiceConditionReason = "ValidationFailed"
223234
TargetClusterChanged RayServiceConditionReason = "TargetClusterChanged"
235+
SuspendRequested RayServiceConditionReason = "SuspendRequested"
236+
SuspendInProgress RayServiceConditionReason = "SuspendInProgress"
237+
SuspendComplete RayServiceConditionReason = "SuspendComplete"
238+
RayServiceResumed RayServiceConditionReason = "RayServiceResumed"
224239
)
225240

226241
// +kubebuilder:object:root=true

‎ray-operator/config/crd/bases/ray.io_rayservices.yaml‎

Lines changed: 2 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)