[Autoscaler] Add noDriverTimeoutSeconds for cluster termination with KubeRay by win5923 · Pull Request #63465 · ray-project/ray

win5923 · 2026-05-18T18:50:40Z

Description

Terminate a cluster managed by the V2 autoscaler when no user driver is attached. Related to ray-project/kuberay#4815

When autoscalerOptions.noDriverTimeoutSeconds is set, the V2 autoscaler evaluates a no-driver predicate every reconcile loop and, when it fires, patches a single annotation on the RayCluster CR:

metadata:
  annotations:
    ray.io/no-driver-ttl-expired: "true"

The KubeRay operator observes the condition and decides the terminal action. (delete RayCluster)

A cluster is eligible for termination only when both of the following hold, and only when they have held continuously for at least noDriverTimeoutSeconds:

No active user driver is attached.
Condition 1 has held for at least noDriverTimeoutSeconds.

Note that detached actors do not count as a driver, a cluster running only detached actors is still eligible for termination.

Changes

This PR adds autoscalerOptions.noDriverTimeoutSeconds. The decision lives on KubeRayProvider: it tracks how long the cluster has had no driver attached and, once the timeout is exceeded, dispatches an annotation for KubeRay to
terminate the cluster, freeing the head pod and any reserved capacity that would otherwise linger.

New autoscalerOptions.noDriverTimeoutSeconds field, V2 + KubeRay only

Existing CRs and existing V1 / non-KubeRay deployments see no behavior change.
The field is read only by KubeRayProvider; unset disables the feature.

No-driver decision lives on KubeRayProvider

Evaluated against gcs_client.get_all_job_info(...), filtering out Ray dashboard jobs. Fails closed.
The provider records when the cluster was first seen with no driver, and dispatches once that has held for noDriverTimeoutSeconds. The timer resets if a driver reappears.

Dispatch: single annotation on the RayCluster CR

The reconciler calls evaluate_no_driver_termination, which patches the RayCluster CR with ray.io/no-driver-ttl-expired: "true". The KubeRay Operator implementation is covered in ray-project/kuberay#4815.

Related issues

Closes #63452

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

gemini-code-assist · 2026-05-18T18:50:44Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…ation Signed-off-by: win5923 <ken89@kimo.com>

Signed-off-by: win5923 <ken89@kimo.com>

…utSeconds Signed-off-by: win5923 <ken89@kimo.com>

… clock in reconciler Signed-off-by: win5923 <ken89@kimo.com>

win5923 · 2026-05-25T17:42:19Z

E2E testing:

/home/ubuntu/ray
$ ./build-image.sh  ray
$ kind create cluster
$ kind load docker-image cr.ray.io/rayproject/ray:nightly

/home/ubuntu/kuberay
$ make -C ray-operator install
$ make -C ray-operator build
$ ./ray-operator/bin/manager -leader-election-namespace default -use-kubernetes-proxy

Apply testing yaml:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-autoscaler
spec:
  # The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
  # Use the Ray nightly or Ray version >= 2.10.0 and KubeRay 1.1.0 or later for autoscaler v2.
  rayVersion: "2.52.0"
  enableInTreeAutoscaling: true
  autoscalerOptions:
    version: v2
    upscalingMode: Default
    idleTimeoutSeconds: 10
    noDriverTimeoutSeconds: 30
    image: cr.ray.io/rayproject/ray:nightly
    imagePullPolicy: IfNotPresent
    # Optionally specify the Autoscaler container's securityContext.
    securityContext: {}
    env: []
    # AUTOSCALER_UPDATE_INTERVAL_S is used to control how often the Autoscaler container checks the cluster state
    # and decides whether to request scaling the cluster up or down. The default value is 5 seconds.
    # - name: AUTOSCALER_UPDATE_INTERVAL_S
    #   value: "5"
    # RAY_LOGGER_LEVEL is used to control autoscaler and ray logging verbosity (info|debug|warning|error|critical). The default value is "info".
    # - name: RAY_LOGGER_LEVEL
    #   value: "debug"
    envFrom: []
    resources:
      limits:
        cpu: "500m"
        memory: "512Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
  # Ray head pod template
  headGroupSpec:
    rayStartParams:
      # Setting "num-cpus: 0" to avoid any Ray actors or tasks being scheduled on the Ray head Pod.
      num-cpus: "0"
    # Pod template
    template:
      spec:
        containers:
        # The Ray head container
        - name: ray-head
          image: cr.ray.io/rayproject/ray:nightly
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              cpu: "1"
              memory: "5Gi"
            requests:
              cpu: "1"
              memory: "2Gi"
          volumeMounts:
          - mountPath: /home/ray/samples
            name: ray-example-configmap
        volumes:
        - name: ray-example-configmap
          configMap:
            name: ray-example
            defaultMode: 0777
            items:
            - key: detached_actor.py
              path: detached_actor.py
            - key: terminate_detached_actor.py
              path: terminate_detached_actor.py
  workerGroupSpecs:
  # the Pod replicas in this group typed worker
  - replicas: 0
    minReplicas: 0
    maxReplicas: 10
    groupName: small-group
    rayStartParams: {}
    # Pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: cr.ray.io/rayproject/ray:nightly
          resources:
            limits:
              cpu: "1"
              memory: "1Gi"
            requests:
              cpu: "1"
              memory: "1Gi"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-example
data:
  detached_actor.py: |
    import ray
    import sys

    @ray.remote(num_cpus=1)
    class Actor:
      pass

    ray.init(namespace="default_namespace")
    Actor.options(name=sys.argv[1], lifetime="detached").remote()

  terminate_detached_actor.py: |
    import ray
    import sys

    ray.init(namespace="default_namespace")
    detached_actor = ray.get_actor(sys.argv[1])
    ray.kill(detached_actor)

RayCluster idle termination without an active driver

If no active Ray driver is detected for more than 30 seconds, the cluster will be marked as idle and deleted by KubeRay.

$ export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
$ kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor1

# After 30 seconds, the RayCluster is deleted
$ k get raycluster
NAME                    DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE

Active driver prevents cluster termination

$ export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
$ kubectl exec -it $HEAD_POD -- python -c "import ray; import time; ray.init(); print(ray.cluster_resources()); time.sleep(30)"

$ k get raycluster
NAME                    DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
raycluster-autoscaler                                         1      2Gi      0      ready    63s


# After the driver exits, wait for the configured timeout (30 seconds). The RayCluster will then be deleted
$ k get raycluster

win5923 · 2026-05-25T17:46:12Z

cc @rueian to take a look if you have time :)

win5923 · 2026-05-27T16:39:52Z

And can we put the entire implementation in the kuberay provider? i.e. try not to make changes to other files.

Nice idea, I’ve moved the implementation into the KubeRay provider and simplified the overall logic.

Signed-off-by: win5923 <ken89@kimo.com>

win5923 · 2026-05-27T18:27:35Z

+    def _refresh_idle_termination_seconds(self) -> None:
+        """Reads idleTerminationSeconds from the cached RayCluster spec.
+
+        Trusts the value as accepted by KubeRay's admission webhook
+        """


In here Ray will unconditionally trust the value from KubeRay, I think we only need to do validation in KubeRay.

Signed-off-by: win5923 <ken89@kimo.com>

rueian · 2026-05-28T22:17:12Z

+            if node.status == NodeStatus.DEAD:
+                continue
+            if is_head_node(node):
+                continue


What if a detached actor is running on the head?

I skip the head node because Ray's internal detached actors are pinned there and hold its idle_duration_ms at 0 permanently. If we didn't skip the head, the cluster-idle predicate would never fire and the feature would be a no-op.

The reason exclude at the node level rather than filter the actors is because the raylet cannot distinguish a Ray-internal detached actor from a user actor, both are just leased workers. Filtering specific internal actors out of the busy calculation would be fragile and would have to enumerate every head-pinned internal actor (_StatsActor, Serve controller/proxy, JobSupervisor, dashboard API actors, the autoscaler itself, …).

Excluding the head node from the idle check is simpler and correct: head idleness is not a meaningful scale-down signal in the first place, since the head is never a terminable worker. Pending resource demand and per-worker idle already cover the cases we care about.

Please let me know if I've said anything incorrect.

But when a user's detached actor is running on the head, you shouldn't terminate the cluster, right?

I checked the codebase and think detecting a user actor on the head would mean telling user and internal actors apart, and Ray has no reliable signal for that. There's no is_internal flag, and ray_namespace/name are user-controllable with no shared convention for internal actors.

A namespace denylist fails both ways:

it misses internal actors we didn't enumerate

it can misclassify a user actor that shares a namespace

Note we don't distinguish internal vs user on workers either, the worker idle check just treats any leased worker as busy. That works on workers because they have no permanently-resident internal actors, so user work there is caught naturally.

For example, Ray Data's _StatsActor, pinned to the driver's node , which is the head in KubeRay

ray/python/ray/data/_internal/stats.py

Lines 861 to 887 in 750ef4e

def get_or_create_stats_actor() -> ActorHandle[_StatsActor]:

"""Each cluster will contain exactly 1 _StatsActor. This function

returns the current _StatsActor handle, or create a new one if one

does not exist in the connected cluster. The _StatsActor is pinned on

on driver process' node.

"""

if ray._private.worker._global_node is None:

raise RuntimeError(

"Global node is not initialized. Driver might be not connected to Ray."

)

current_cluster_id = ray._private.worker._global_node.cluster_id

logger.debug(f"Stats Actor located on cluster_id={current_cluster_id}")

# so it fate-shares with the driver.

label_selector = {

ray._raylet.RAY_NODE_ID_KEY: ray.get_runtime_context().get_node_id()

}

return _StatsActor.options(

name=STATS_ACTOR_NAME,

namespace=STATS_ACTOR_NAMESPACE,

get_if_exists=True,

lifetime="detached",

label_selector=label_selector,

).remote()

Trade-off: keep the head excluded and rely on the worker idle signal. The only blind spot is a user actor pinned to the head. I think this need explicitly in the Ray docs so users know not to pin detached actors to the head when cluster-level idle termination is enabled.

Can we rename this feature to something like noDriverTerminationTimeout, and we only check if there is a driver running?

For sure! I've update the implementation and description. d5837c6

This is a great suggestion. It avoids having to reason about head vs. worker nodes at all, also matches what we actually cared about whether the cluster has any running Ray jobs.

Signed-off-by: win5923 <ken89@kimo.com>

…KubeRay Signed-off-by: win5923 <ken89@kimo.com>

Signed-off-by: win5923 <ken89@kimo.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{Reviewed by Cursor Bugbot for commit 96cc40b. Configure here.}

Signed-off-by: win5923 <ken89@kimo.com>

rueian

Thanks @win5923! LGTM!

rueian · 2026-06-05T20:09:54Z

@edoakes, please review and merge this when you have a chance. 🙏

… KubeRay (ray-project#63465) ## Description Terminate a cluster managed by the V2 autoscaler when no user driver is attached. Related to ray-project/kuberay#4815 When `autoscalerOptions.noDriverTimeoutSeconds` is set, the V2 autoscaler evaluates a no-driver predicate every reconcile loop and, when it fires, patches a single annotation on the RayCluster CR: ```yaml metadata: annotations: ray.io/no-driver-ttl-expired: "true" The KubeRay operator observes the condition and decides the terminal action. (delete RayCluster) A cluster is eligible for termination only when both of the following hold, and only when they have held continuously for at least noDriverTimeoutSeconds: 1. No active user driver is attached. 2. Condition 1 has held for at least noDriverTimeoutSeconds. Note that **detached actors do not count as a driver, a cluster running only detached actors is still eligible for termination.** Changes This PR adds `autoscalerOptions.noDriverTimeoutSeconds`. The decision lives on `KubeRayProvider`: it tracks how long the cluster has had no driver attached and, once the timeout is exceeded, dispatches an annotation for KubeRay to terminate the cluster, freeing the head pod and any reserved capacity that would otherwise linger. 1. New autoscalerOptions.noDriverTimeoutSeconds field, V2 + KubeRay only - Existing CRs and existing V1 / non-KubeRay deployments see no behavior change. - The field is read only by KubeRayProvider; unset disables the feature. 2. No-driver decision lives on KubeRayProvider - Evaluated against `gcs_client.get_all_job_info(...)`, filtering out Ray dashboard jobs. Fails closed. - The provider records when the cluster was first seen with no driver, and dispatches once that has held for `noDriverTimeoutSeconds`. The timer resets if a driver reappears. 3. Dispatch: single annotation on the RayCluster CR The reconciler calls `evaluate_no_driver_termination`, which patches the RayCluster CR with `ray.io/no-driver-ttl-expired: "true"`. The KubeRay Operator implementation is covered in ray-project/kuberay#4815. ## Related issues Closes ray-project#63452 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: win5923 <ken89@kimo.com>

[Autoscaler] Add idleTerminationSeconds for cluster-level idle termin…

3298172

…ation Signed-off-by: win5923 <ken89@kimo.com>

win5923 force-pushed the autoscaler-terminate-idle branch from cad0af4 to 3298172 Compare May 19, 2026 15:47

Skip head from cluster-idle raylet check; filter Ray dashboard drivers

7f1847d

Signed-off-by: win5923 <ken89@kimo.com>

win5923 force-pushed the autoscaler-terminate-idle branch from 9bd2e92 to 7f1847d Compare May 20, 2026 17:59

win5923 added 2 commits May 20, 2026 18:07

Allow idleTerminationSeconds to be greater than or equal to idleTimeo…

f6e0b4d

…utSeconds Signed-off-by: win5923 <ken89@kimo.com>

Merge branch 'master' into autoscaler-terminate-idle

4cf4b5e

win5923 marked this pull request as ready for review May 24, 2026 16:39

win5923 requested a review from a team as a code owner May 24, 2026 16:39

cursor Bot reviewed May 24, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/scheduler.py Outdated

[Autoscaler] Drop scheduler-side threshold; gate cluster-idle by wall…

fe5961f

… clock in reconciler Signed-off-by: win5923 <ken89@kimo.com>

cursor Bot reviewed May 24, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/scheduler.py Outdated

ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community kubernetes labels May 24, 2026

Merge branch 'master' into autoscaler-terminate-idle

393e2d3

cursor Bot reviewed May 25, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/instance_manager/reconciler.py Outdated

rueian reviewed May 26, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/scheduler.py Outdated

win5923 force-pushed the autoscaler-terminate-idle branch 3 times, most recently from 208310e to 318cd5e Compare May 27, 2026 16:26

win5923 force-pushed the autoscaler-terminate-idle branch from 318cd5e to 2f35617 Compare May 27, 2026 16:57

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread python/ray/autoscaler/_private/kuberay/autoscaling_config.py Outdated

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py

Move cluster-idle decision into KubeRayProvider

6eefbb3

Signed-off-by: win5923 <ken89@kimo.com>

win5923 force-pushed the autoscaler-terminate-idle branch from 2f35617 to 6eefbb3 Compare May 27, 2026 17:20

rueian reviewed May 27, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py Outdated

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py Outdated

rueian reviewed May 27, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/instance_manager/reconciler.py Outdated

rueian reviewed May 27, 2026

View reviewed changes

Comment thread python/ray/autoscaler/ray-schema.json Outdated

Remove idle_termination_seconds in ray-schema

6057d22

Signed-off-by: win5923 <ken89@kimo.com>

win5923 commented May 27, 2026

View reviewed changes

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py Outdated

[Autoscaler] Privatize cluster-idle logic to KubeRayProvider

4d44f81

Signed-off-by: win5923 <ken89@kimo.com>

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py Outdated

Reuse ray_state for cluster-idle eval

e1f48e6

Signed-off-by: win5923 <ken89@kimo.com>

rueian reviewed May 28, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py Outdated

Remove check worker node count

8001531

Signed-off-by: win5923 <ken89@kimo.com>

cursor Bot reviewed May 29, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py Outdated

Reslove comments

3af39b1

Signed-off-by: win5923 <ken89@kimo.com>

win5923 changed the title ~~[Autoscaler] Add idleTerminationSeconds for cluster-level idle termination~~ May 29, 2026

win5923 changed the title ~~[Autoscaler] Add idleTerminationSeconds for cluster-level idle termination with KubeRay~~ Jun 2, 2026

[Autoscaler] Add noDriverTimeoutSeconds for cluster termination with …

d5837c6

…KubeRay Signed-off-by: win5923 <ken89@kimo.com>

win5923 force-pushed the autoscaler-terminate-idle branch from f476aec to d5837c6 Compare June 2, 2026 12:21

rueian reviewed Jun 3, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py Outdated

Filter internal drivers by namespace

96cc40b

Signed-off-by: win5923 <ken89@kimo.com>

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py

Reset no-driver timer on intermittent drivers

b7f173c

Signed-off-by: win5923 <ken89@kimo.com>

rueian approved these changes Jun 5, 2026

View reviewed changes

rueian added the go add ONLY when ready to merge, run all tests label Jun 5, 2026

edoakes merged commit a92b357 into ray-project:master Jun 5, 2026
7 of 8 checks passed

win5923 deleted the autoscaler-terminate-idle branch June 6, 2026 00:59

win5923 mentioned this pull request Jun 19, 2026

[ray-operator] Add noDriverTimeoutSeconds for idle RayCluster termination ray-project/kuberay#4932

Open

4 tasks

	def get_or_create_stats_actor() -> ActorHandle[_StatsActor]:
	"""Each cluster will contain exactly 1 _StatsActor. This function
	returns the current _StatsActor handle, or create a new one if one
	does not exist in the connected cluster. The _StatsActor is pinned on
	on driver process' node.
	"""
	if ray._private.worker._global_node is None:
	raise RuntimeError(
	"Global node is not initialized. Driver might be not connected to Ray."
	)

	current_cluster_id = ray._private.worker._global_node.cluster_id

	logger.debug(f"Stats Actor located on cluster_id={current_cluster_id}")

	# so it fate-shares with the driver.
	label_selector = {
	ray._raylet.RAY_NODE_ID_KEY: ray.get_runtime_context().get_node_id()
	}

	return _StatsActor.options(
	name=STATS_ACTOR_NAME,
	namespace=STATS_ACTOR_NAMESPACE,
	get_if_exists=True,
	lifetime="detached",
	label_selector=label_selector,
	).remote()

Uh oh!

Conversation

win5923 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Related issues

Additional information

gemini-code-assist Bot commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

win5923 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E testing:

Apply testing yaml:

RayCluster idle termination without an active driver

Active driver prevents cluster termination

win5923 commented May 25, 2026

Uh oh!

win5923 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

win5923 May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rueian May 28, 2026

Choose a reason for hiding this comment

win5923 May 29, 2026

Choose a reason for hiding this comment

rueian May 29, 2026

Choose a reason for hiding this comment

win5923 May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

rueian Jun 2, 2026

Choose a reason for hiding this comment

win5923 Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

rueian left a comment

Choose a reason for hiding this comment

rueian commented Jun 5, 2026

Uh oh!

Labels

3 participants

win5923 commented May 18, 2026 •

edited

Loading

win5923 commented May 25, 2026 •

edited

Loading

win5923 commented May 27, 2026 •

edited

Loading

win5923 May 27, 2026 •

edited

Loading

win5923 May 31, 2026 •

edited

Loading

win5923 Jun 2, 2026 •

edited

Loading