[Autoscaler] Add noDriverTimeoutSeconds for cluster termination with KubeRay#63465
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…ation Signed-off-by: win5923 <ken89@kimo.com>
cad0af4 to
3298172
Compare
Signed-off-by: win5923 <ken89@kimo.com>
9bd2e92 to
7f1847d
Compare
…utSeconds Signed-off-by: win5923 <ken89@kimo.com>
… clock in reconciler Signed-off-by: win5923 <ken89@kimo.com>
E2E testing:Apply testing yaml:RayCluster idle termination without an active driverIf no active Ray driver is detected for more than 30 seconds, the cluster will be marked as idle and deleted by KubeRay. Active driver prevents cluster termination |
|
cc @rueian to take a look if you have time :) |
208310e to
318cd5e
Compare
Nice idea, I’ve moved the implementation into the KubeRay provider and simplified the overall logic. |
318cd5e to
2f35617
Compare
Signed-off-by: win5923 <ken89@kimo.com>
2f35617 to
6eefbb3
Compare
Signed-off-by: win5923 <ken89@kimo.com>
| def _refresh_idle_termination_seconds(self) -> None: | ||
| """Reads idleTerminationSeconds from the cached RayCluster spec. | ||
|
|
||
| Trusts the value as accepted by KubeRay's admission webhook | ||
| """ |
There was a problem hiding this comment.
In here Ray will unconditionally trust the value from KubeRay, I think we only need to do validation in KubeRay.
Signed-off-by: win5923 <ken89@kimo.com>
Signed-off-by: win5923 <ken89@kimo.com>
| if node.status == NodeStatus.DEAD: | ||
| continue | ||
| if is_head_node(node): | ||
| continue |
There was a problem hiding this comment.
What if a detached actor is running on the head?
There was a problem hiding this comment.
I skip the head node because Ray's internal detached actors are pinned there and hold its idle_duration_ms at 0 permanently. If we didn't skip the head, the cluster-idle predicate would never fire and the feature would be a no-op.
The reason exclude at the node level rather than filter the actors is because the raylet cannot distinguish a Ray-internal detached actor from a user actor, both are just leased workers. Filtering specific internal actors out of the busy calculation would be fragile and would have to enumerate every head-pinned internal actor (_StatsActor, Serve controller/proxy, JobSupervisor, dashboard API actors, the autoscaler itself, …).
Excluding the head node from the idle check is simpler and correct: head idleness is not a meaningful scale-down signal in the first place, since the head is never a terminable worker. Pending resource demand and per-worker idle already cover the cases we care about.
Please let me know if I've said anything incorrect.
There was a problem hiding this comment.
But when a user's detached actor is running on the head, you shouldn't terminate the cluster, right?
There was a problem hiding this comment.
I checked the codebase and think detecting a user actor on the head would mean telling user and internal actors apart, and Ray has no reliable signal for that. There's no is_internal flag, and ray_namespace/name are user-controllable with no shared convention for internal actors.
A namespace denylist fails both ways:
- it misses internal actors we didn't enumerate
- it can misclassify a user actor that shares a namespace
Note we don't distinguish internal vs user on workers either, the worker idle check just treats any leased worker as busy. That works on workers because they have no permanently-resident internal actors, so user work there is caught naturally.
For example, Ray Data's _StatsActor, pinned to the driver's node , which is the head in KubeRay
ray/python/ray/data/_internal/stats.py
Lines 861 to 887 in 750ef4e
Trade-off: keep the head excluded and rely on the worker idle signal. The only blind spot is a user actor pinned to the head. I think this need explicitly in the Ray docs so users know not to pin detached actors to the head when cluster-level idle termination is enabled.
There was a problem hiding this comment.
Can we rename this feature to something like noDriverTerminationTimeout, and we only check if there is a driver running?
There was a problem hiding this comment.
For sure! I've update the implementation and description. d5837c6
This is a great suggestion. It avoids having to reason about head vs. worker nodes at all, also matches what we actually cared about whether the cluster has any running Ray jobs.
Signed-off-by: win5923 <ken89@kimo.com>
Signed-off-by: win5923 <ken89@kimo.com>
…KubeRay Signed-off-by: win5923 <ken89@kimo.com>
f476aec to
d5837c6
Compare
Signed-off-by: win5923 <ken89@kimo.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Reviewed by Cursor Bugbot for commit 96cc40b. Configure here.
Signed-off-by: win5923 <ken89@kimo.com>
|
@edoakes, please review and merge this when you have a chance. 🙏 |
… KubeRay (ray-project#63465) ## Description Terminate a cluster managed by the V2 autoscaler when no user driver is attached. Related to ray-project/kuberay#4815 When `autoscalerOptions.noDriverTimeoutSeconds` is set, the V2 autoscaler evaluates a no-driver predicate every reconcile loop and, when it fires, patches a single annotation on the RayCluster CR: ```yaml metadata: annotations: ray.io/no-driver-ttl-expired: "true" The KubeRay operator observes the condition and decides the terminal action. (delete RayCluster) A cluster is eligible for termination only when both of the following hold, and only when they have held continuously for at least noDriverTimeoutSeconds: 1. No active user driver is attached. 2. Condition 1 has held for at least noDriverTimeoutSeconds. Note that **detached actors do not count as a driver, a cluster running only detached actors is still eligible for termination.** Changes This PR adds `autoscalerOptions.noDriverTimeoutSeconds`. The decision lives on `KubeRayProvider`: it tracks how long the cluster has had no driver attached and, once the timeout is exceeded, dispatches an annotation for KubeRay to terminate the cluster, freeing the head pod and any reserved capacity that would otherwise linger. 1. New autoscalerOptions.noDriverTimeoutSeconds field, V2 + KubeRay only - Existing CRs and existing V1 / non-KubeRay deployments see no behavior change. - The field is read only by KubeRayProvider; unset disables the feature. 2. No-driver decision lives on KubeRayProvider - Evaluated against `gcs_client.get_all_job_info(...)`, filtering out Ray dashboard jobs. Fails closed. - The provider records when the cluster was first seen with no driver, and dispatches once that has held for `noDriverTimeoutSeconds`. The timer resets if a driver reappears. 3. Dispatch: single annotation on the RayCluster CR The reconciler calls `evaluate_no_driver_termination`, which patches the RayCluster CR with `ray.io/no-driver-ttl-expired: "true"`. The KubeRay Operator implementation is covered in ray-project/kuberay#4815. ## Related issues Closes ray-project#63452 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: win5923 <ken89@kimo.com>

Description
Terminate a cluster managed by the V2 autoscaler when no user driver is attached. Related to ray-project/kuberay#4815
When
autoscalerOptions.noDriverTimeoutSecondsis set, the V2 autoscaler evaluates a no-driver predicate every reconcile loop and, when it fires, patches a single annotation on the RayCluster CR:The KubeRay operator observes the condition and decides the terminal action. (delete RayCluster)
A cluster is eligible for termination only when both of the following hold, and only when they have held continuously for at least noDriverTimeoutSeconds:
Note that detached actors do not count as a driver, a cluster running only detached actors is still eligible for termination.
Changes
This PR adds
autoscalerOptions.noDriverTimeoutSeconds. The decision lives onKubeRayProvider: it tracks how long the cluster has had no driver attached and, once the timeout is exceeded, dispatches an annotation for KubeRay toterminate the cluster, freeing the head pod and any reserved capacity that would otherwise linger.
gcs_client.get_all_job_info(...), filtering out Ray dashboard jobs. Fails closed.noDriverTimeoutSeconds. The timer resets if a driver reappears.The reconciler calls
evaluate_no_driver_termination, which patches the RayCluster CR withray.io/no-driver-ttl-expired: "true". The KubeRay Operator implementation is covered in ray-project/kuberay#4815.Related issues
Closes #63452
Additional information