Skip to content

[Autoscaler] Add noDriverTimeoutSeconds for cluster termination with KubeRay#63465

Merged
edoakes merged 15 commits into
ray-project:masterfrom
win5923:autoscaler-terminate-idle
Jun 5, 2026
Merged

[Autoscaler] Add noDriverTimeoutSeconds for cluster termination with KubeRay#63465
edoakes merged 15 commits into
ray-project:masterfrom
win5923:autoscaler-terminate-idle

Conversation

@win5923

@win5923 win5923 commented May 18, 2026

Copy link
Copy Markdown
Member

Description

Terminate a cluster managed by the V2 autoscaler when no user driver is attached. Related to ray-project/kuberay#4815

When autoscalerOptions.noDriverTimeoutSeconds is set, the V2 autoscaler evaluates a no-driver predicate every reconcile loop and, when it fires, patches a single annotation on the RayCluster CR:

metadata:
  annotations:
    ray.io/no-driver-ttl-expired: "true"

The KubeRay operator observes the condition and decides the terminal action. (delete RayCluster)

A cluster is eligible for termination only when both of the following hold, and only when they have held continuously for at least noDriverTimeoutSeconds:

  1. No active user driver is attached.
  2. Condition 1 has held for at least noDriverTimeoutSeconds.

Note that detached actors do not count as a driver, a cluster running only detached actors is still eligible for termination.

Changes

This PR adds autoscalerOptions.noDriverTimeoutSeconds. The decision lives on KubeRayProvider: it tracks how long the cluster has had no driver attached and, once the timeout is exceeded, dispatches an annotation for KubeRay to
terminate the cluster, freeing the head pod and any reserved capacity that would otherwise linger.

  1. New autoscalerOptions.noDriverTimeoutSeconds field, V2 + KubeRay only
  • Existing CRs and existing V1 / non-KubeRay deployments see no behavior change.
  • The field is read only by KubeRayProvider; unset disables the feature.
  1. No-driver decision lives on KubeRayProvider
  • Evaluated against gcs_client.get_all_job_info(...), filtering out Ray dashboard jobs. Fails closed.
  • The provider records when the cluster was first seen with no driver, and dispatches once that has held for noDriverTimeoutSeconds. The timer resets if a driver reappears.
  1. Dispatch: single annotation on the RayCluster CR

The reconciler calls evaluate_no_driver_termination, which patches the RayCluster CR with ray.io/no-driver-ttl-expired: "true". The KubeRay Operator implementation is covered in ray-project/kuberay#4815.

Related issues

Closes #63452

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…ation

Signed-off-by: win5923 <ken89@kimo.com>
@win5923 win5923 force-pushed the autoscaler-terminate-idle branch from cad0af4 to 3298172 Compare May 19, 2026 15:47
@win5923 win5923 force-pushed the autoscaler-terminate-idle branch from 9bd2e92 to 7f1847d Compare May 20, 2026 17:59
@win5923 win5923 marked this pull request as ready for review May 24, 2026 16:39
@win5923 win5923 requested a review from a team as a code owner May 24, 2026 16:39
Comment thread python/ray/autoscaler/v2/scheduler.py Outdated
… clock in reconciler

Signed-off-by: win5923 <ken89@kimo.com>
Comment thread python/ray/autoscaler/v2/scheduler.py Outdated
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community kubernetes labels May 24, 2026
Comment thread python/ray/autoscaler/v2/instance_manager/reconciler.py Outdated
@win5923

win5923 commented May 25, 2026

Copy link
Copy Markdown
Member Author

E2E testing:

/home/ubuntu/ray
$ ./build-image.sh  ray
$ kind create cluster
$ kind load docker-image cr.ray.io/rayproject/ray:nightly

/home/ubuntu/kuberay
$ make -C ray-operator install
$ make -C ray-operator build
$ ./ray-operator/bin/manager -leader-election-namespace default -use-kubernetes-proxy

Apply testing yaml:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-autoscaler
spec:
  # The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
  # Use the Ray nightly or Ray version >= 2.10.0 and KubeRay 1.1.0 or later for autoscaler v2.
  rayVersion: "2.52.0"
  enableInTreeAutoscaling: true
  autoscalerOptions:
    version: v2
    upscalingMode: Default
    idleTimeoutSeconds: 10
    noDriverTimeoutSeconds: 30
    image: cr.ray.io/rayproject/ray:nightly
    imagePullPolicy: IfNotPresent
    # Optionally specify the Autoscaler container's securityContext.
    securityContext: {}
    env: []
    # AUTOSCALER_UPDATE_INTERVAL_S is used to control how often the Autoscaler container checks the cluster state
    # and decides whether to request scaling the cluster up or down. The default value is 5 seconds.
    # - name: AUTOSCALER_UPDATE_INTERVAL_S
    #   value: "5"
    # RAY_LOGGER_LEVEL is used to control autoscaler and ray logging verbosity (info|debug|warning|error|critical). The default value is "info".
    # - name: RAY_LOGGER_LEVEL
    #   value: "debug"
    envFrom: []
    resources:
      limits:
        cpu: "500m"
        memory: "512Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
  # Ray head pod template
  headGroupSpec:
    rayStartParams:
      # Setting "num-cpus: 0" to avoid any Ray actors or tasks being scheduled on the Ray head Pod.
      num-cpus: "0"
    # Pod template
    template:
      spec:
        containers:
        # The Ray head container
        - name: ray-head
          image: cr.ray.io/rayproject/ray:nightly
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              cpu: "1"
              memory: "5Gi"
            requests:
              cpu: "1"
              memory: "2Gi"
          volumeMounts:
          - mountPath: /home/ray/samples
            name: ray-example-configmap
        volumes:
        - name: ray-example-configmap
          configMap:
            name: ray-example
            defaultMode: 0777
            items:
            - key: detached_actor.py
              path: detached_actor.py
            - key: terminate_detached_actor.py
              path: terminate_detached_actor.py
  workerGroupSpecs:
  # the Pod replicas in this group typed worker
  - replicas: 0
    minReplicas: 0
    maxReplicas: 10
    groupName: small-group
    rayStartParams: {}
    # Pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: cr.ray.io/rayproject/ray:nightly
          resources:
            limits:
              cpu: "1"
              memory: "1Gi"
            requests:
              cpu: "1"
              memory: "1Gi"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-example
data:
  detached_actor.py: |
    import ray
    import sys

    @ray.remote(num_cpus=1)
    class Actor:
      pass

    ray.init(namespace="default_namespace")
    Actor.options(name=sys.argv[1], lifetime="detached").remote()

  terminate_detached_actor.py: |
    import ray
    import sys

    ray.init(namespace="default_namespace")
    detached_actor = ray.get_actor(sys.argv[1])
    ray.kill(detached_actor)

RayCluster idle termination without an active driver

If no active Ray driver is detected for more than 30 seconds, the cluster will be marked as idle and deleted by KubeRay.

$ export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
$ kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor1

# After 30 seconds, the RayCluster is deleted
$ k get raycluster
NAME                    DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE

Active driver prevents cluster termination

$ export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
$ kubectl exec -it $HEAD_POD -- python -c "import ray; import time; ray.init(); print(ray.cluster_resources()); time.sleep(30)"

$ k get raycluster
NAME                    DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
raycluster-autoscaler                                         1      2Gi      0      ready    63s


# After the driver exits, wait for the configured timeout (30 seconds). The RayCluster will then be deleted
$ k get raycluster
@win5923

win5923 commented May 25, 2026

Copy link
Copy Markdown
Member Author

cc @rueian to take a look if you have time :)

Comment thread python/ray/autoscaler/v2/scheduler.py Outdated
@win5923 win5923 force-pushed the autoscaler-terminate-idle branch 3 times, most recently from 208310e to 318cd5e Compare May 27, 2026 16:26
@win5923

win5923 commented May 27, 2026

Copy link
Copy Markdown
Member Author

And can we put the entire implementation in the kuberay provider? i.e. try not to make changes to other files.

Nice idea, I’ve moved the implementation into the KubeRay provider and simplified the overall logic.

@win5923 win5923 force-pushed the autoscaler-terminate-idle branch from 318cd5e to 2f35617 Compare May 27, 2026 16:57
Comment thread python/ray/autoscaler/_private/kuberay/autoscaling_config.py Outdated
Signed-off-by: win5923 <ken89@kimo.com>
@win5923 win5923 force-pushed the autoscaler-terminate-idle branch from 2f35617 to 6eefbb3 Compare May 27, 2026 17:20
Comment thread python/ray/autoscaler/v2/instance_manager/reconciler.py Outdated
Comment thread python/ray/autoscaler/ray-schema.json Outdated
Signed-off-by: win5923 <ken89@kimo.com>
Comment on lines +487 to +491
def _refresh_idle_termination_seconds(self) -> None:
"""Reads idleTerminationSeconds from the cached RayCluster spec.

Trusts the value as accepted by KubeRay's admission webhook
"""

@win5923 win5923 May 27, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In here Ray will unconditionally trust the value from KubeRay, I think we only need to do validation in KubeRay.

Signed-off-by: win5923 <ken89@kimo.com>
Signed-off-by: win5923 <ken89@kimo.com>
if node.status == NodeStatus.DEAD:
continue
if is_head_node(node):
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a detached actor is running on the head?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skip the head node because Ray's internal detached actors are pinned there and hold its idle_duration_ms at 0 permanently. If we didn't skip the head, the cluster-idle predicate would never fire and the feature would be a no-op.

The reason exclude at the node level rather than filter the actors is because the raylet cannot distinguish a Ray-internal detached actor from a user actor, both are just leased workers. Filtering specific internal actors out of the busy calculation would be fragile and would have to enumerate every head-pinned internal actor (_StatsActor, Serve controller/proxy, JobSupervisor, dashboard API actors, the autoscaler itself, …).

Excluding the head node from the idle check is simpler and correct: head idleness is not a meaningful scale-down signal in the first place, since the head is never a terminable worker. Pending resource demand and per-worker idle already cover the cases we care about.

Please let me know if I've said anything incorrect.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when a user's detached actor is running on the head, you shouldn't terminate the cluster, right?

@win5923 win5923 May 31, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the codebase and think detecting a user actor on the head would mean telling user and internal actors apart, and Ray has no reliable signal for that. There's no is_internal flag, and ray_namespace/name are user-controllable with no shared convention for internal actors.

A namespace denylist fails both ways:

  • it misses internal actors we didn't enumerate
  • it can misclassify a user actor that shares a namespace

Note we don't distinguish internal vs user on workers either, the worker idle check just treats any leased worker as busy. That works on workers because they have no permanently-resident internal actors, so user work there is caught naturally.

For example, Ray Data's _StatsActor, pinned to the driver's node , which is the head in KubeRay

def get_or_create_stats_actor() -> ActorHandle[_StatsActor]:
"""Each cluster will contain exactly 1 _StatsActor. This function
returns the current _StatsActor handle, or create a new one if one
does not exist in the connected cluster. The _StatsActor is pinned on
on driver process' node.
"""
if ray._private.worker._global_node is None:
raise RuntimeError(
"Global node is not initialized. Driver might be not connected to Ray."
)
current_cluster_id = ray._private.worker._global_node.cluster_id
logger.debug(f"Stats Actor located on cluster_id={current_cluster_id}")
# so it fate-shares with the driver.
label_selector = {
ray._raylet.RAY_NODE_ID_KEY: ray.get_runtime_context().get_node_id()
}
return _StatsActor.options(
name=STATS_ACTOR_NAME,
namespace=STATS_ACTOR_NAMESPACE,
get_if_exists=True,
lifetime="detached",
label_selector=label_selector,
).remote()

Trade-off: keep the head excluded and rely on the worker idle signal. The only blind spot is a user actor pinned to the head. I think this need explicitly in the Ray docs so users know not to pin detached actors to the head when cluster-level idle termination is enabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this feature to something like noDriverTerminationTimeout, and we only check if there is a driver running?

@win5923 win5923 Jun 2, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure! I've update the implementation and description. d5837c6

This is a great suggestion. It avoids having to reason about head vs. worker nodes at all, also matches what we actually cared about whether the cluster has any running Ray jobs.

Signed-off-by: win5923 <ken89@kimo.com>
Signed-off-by: win5923 <ken89@kimo.com>
@win5923 win5923 changed the title [Autoscaler] Add idleTerminationSeconds for cluster-level idle termination May 29, 2026
@win5923 win5923 changed the title [Autoscaler] Add idleTerminationSeconds for cluster-level idle termination with KubeRay Jun 2, 2026
…KubeRay

Signed-off-by: win5923 <ken89@kimo.com>
@win5923 win5923 force-pushed the autoscaler-terminate-idle branch from f476aec to d5837c6 Compare June 2, 2026 12:21
Signed-off-by: win5923 <ken89@kimo.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 96cc40b. Configure here.

Signed-off-by: win5923 <ken89@kimo.com>

@rueian rueian left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @win5923! LGTM!

@rueian

rueian commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@edoakes, please review and merge this when you have a chance. 🙏

@rueian rueian added the go add ONLY when ready to merge, run all tests label Jun 5, 2026
@edoakes edoakes merged commit a92b357 into ray-project:master Jun 5, 2026
7 of 8 checks passed
@win5923 win5923 deleted the autoscaler-terminate-idle branch June 6, 2026 00:59
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
… KubeRay (ray-project#63465)

## Description
Terminate a cluster managed by the V2 autoscaler when no user driver is
attached. Related to ray-project/kuberay#4815

When `autoscalerOptions.noDriverTimeoutSeconds` is set, the V2
autoscaler evaluates a no-driver predicate every reconcile loop and,
when it fires, patches a single annotation on the RayCluster CR:

    ```yaml
    metadata:
      annotations:
        ray.io/no-driver-ttl-expired: "true"

The KubeRay operator observes the condition and decides the terminal
action. (delete RayCluster)

A cluster is eligible for termination only when both of the following
hold, and only when they have held continuously for at least
noDriverTimeoutSeconds:

  1. No active user driver is attached.
  2. Condition 1 has held for at least noDriverTimeoutSeconds.

Note that **detached actors do not count as a driver, a cluster running
only detached actors is still eligible for termination.**

  Changes

This PR adds `autoscalerOptions.noDriverTimeoutSeconds`. The decision
lives on `KubeRayProvider`: it tracks how long the cluster has had no
driver attached and, once the timeout is exceeded, dispatches an
annotation for KubeRay to
terminate the cluster, freeing the head pod and any reserved capacity
that would otherwise linger.

1. New autoscalerOptions.noDriverTimeoutSeconds field, V2 + KubeRay only

- Existing CRs and existing V1 / non-KubeRay deployments see no behavior
change.
- The field is read only by KubeRayProvider; unset disables the feature.

  2. No-driver decision lives on KubeRayProvider

- Evaluated against `gcs_client.get_all_job_info(...)`, filtering out
Ray dashboard jobs. Fails closed.
- The provider records when the cluster was first seen with no driver,
and dispatches once that has held for `noDriverTimeoutSeconds`. The
timer resets if a driver reappears.

  3. Dispatch: single annotation on the RayCluster CR

The reconciler calls `evaluate_no_driver_termination`, which patches the
RayCluster CR with `ray.io/no-driver-ttl-expired: "true"`. The KubeRay
Operator implementation is covered in
ray-project/kuberay#4815.

## Related issues
Closes ray-project#63452

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: win5923 <ken89@kimo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests kubernetes

3 participants