Skip to content

[Bug] SidecarMode RayJob stuck in Initializing forever when the submitter fails before the RayCluster is Ready #4945

Description

@tmrtmrtmrtmr

Search before asking

  • I had searched in the issues and found no similar feature requirement.

KubeRay Component

ray-operator

What happened + What you expected to happen

When a RayJob uses submissionMode: SidecarMode, the ray-job-submitter runs as a regular container inside the head Pod, and the head Pod's RestartPolicy is forced to Never.

If the submitter container exits non-zero before the RayCluster reaches the Ready state — e.g. the Ray job fails while worker Pods are still being scheduled (we hit this with a gang scheduler holding workers Pending) — the RayJob becomes permanently wedged in jobDeploymentStatus: Initializing and never transitions to Failed. Consequently shutdownAfterJobFinishes never runs, the RayCluster is never torn down, and the autoscaler keeps trying to bring up workers.

Root cause (reproduced on master and v1.6.x):

  1. A terminated submitter container (head Pod RestartPolicy=Never) keeps the head Pod Ready=False ("containers with unready status: [ray-job-submitter]"), because head-pod readiness considers every container in the head Pod.
  2. So RayCluster.Status.State never becomes Ready.
  3. The RayJob Initializing -> Running transition is hard-gated on rayClusterInstance.Status.State == rayv1.Ready (the JobDeploymentStatusInitializing case logs "Wait for the RayCluster.Status.State to be ready before submitting the job." and breaks every reconcile).
  4. The only submitter-failure detector, checkSubmitterAndUpdateStatusIfNeeded, runs only in the Running case — unreachable while the cluster is not Ready.

So the submitter's failure is fully observable during Initializing (the container is Terminated with a non-zero exit code), but the controller never inspects it there, and the RayJob deadlocks.

Expected: a SidecarMode submitter that fails terminally should drive the RayJob to Failed even if the RayCluster never reached Ready, so cleanup / shutdownAfterJobFinishes / backoffLimit proceed normally.

Operator log (repeats forever):

controllers.RayJob  Wait for the RayCluster.Status.State to be ready before submitting the job.  ...  State:""

with RayCluster conditions HeadPodReady=False (reason Error, "containers with unready status: [ray-job-submitter]") and RayClusterProvisioned=False.

Reproduction script

  1. Submit a RayJob with submissionMode: SidecarMode whose entrypoint exits non-zero, on a cluster where worker Pods are slow to schedule (or use a gang scheduler that holds workers Pending) so the RayCluster does not reach Ready before the submitter exits.
  2. kubectl get rayjob stays in Initializing indefinitely; the head Pod is 2/3 / Error; the RayCluster never becomes ready; no teardown happens.

Anything else

Unfixed on master (verified) and v1.6.x. Related to #4637 (fail-fast on deterministic pre-Running failures). The opt-in, time-based spec.preRunningDeadlineSeconds is only a partial mitigation: it reacts after a wall-clock timeout and reports PreRunningDeadlineExceeded rather than the real submitter failure.

I have a fix and will open a PR that detects the submitter failure during Initializing.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions