[Bug] SidecarMode RayJob stuck in Initializing forever when the submitter fails before the RayCluster is Ready

Search before asking

I had searched in the issues and found no similar feature requirement.

KubeRay Component

ray-operator

What happened + What you expected to happen

When a RayJob uses submissionMode: SidecarMode, the ray-job-submitter runs as a regular container inside the head Pod, and the head Pod's RestartPolicy is forced to Never.

If the submitter container exits non-zero before the RayCluster reaches the Ready state — e.g. the Ray job fails while worker Pods are still being scheduled (we hit this with a gang scheduler holding workers Pending) — the RayJob becomes permanently wedged in jobDeploymentStatus: Initializing and never transitions to Failed. Consequently shutdownAfterJobFinishes never runs, the RayCluster is never torn down, and the autoscaler keeps trying to bring up workers.

Root cause (reproduced on master and v1.6.x):

A terminated submitter container (head Pod RestartPolicy=Never) keeps the head Pod Ready=False ("containers with unready status: [ray-job-submitter]"), because head-pod readiness considers every container in the head Pod.
So RayCluster.Status.State never becomes Ready.
The RayJob Initializing -> Running transition is hard-gated on rayClusterInstance.Status.State == rayv1.Ready (the JobDeploymentStatusInitializing case logs "Wait for the RayCluster.Status.State to be ready before submitting the job." and breaks every reconcile).
The only submitter-failure detector, checkSubmitterAndUpdateStatusIfNeeded, runs only in the Running case — unreachable while the cluster is not Ready.

So the submitter's failure is fully observable during Initializing (the container is Terminated with a non-zero exit code), but the controller never inspects it there, and the RayJob deadlocks.

Expected: a SidecarMode submitter that fails terminally should drive the RayJob to Failed even if the RayCluster never reached Ready, so cleanup / shutdownAfterJobFinishes / backoffLimit proceed normally.

Operator log (repeats forever):

controllers.RayJob  Wait for the RayCluster.Status.State to be ready before submitting the job.  ...  State:""

with RayCluster conditions HeadPodReady=False (reason Error, "containers with unready status: [ray-job-submitter]") and RayClusterProvisioned=False.

Reproduction script

Submit a RayJob with submissionMode: SidecarMode whose entrypoint exits non-zero, on a cluster where worker Pods are slow to schedule (or use a gang scheduler that holds workers Pending) so the RayCluster does not reach Ready before the submitter exits.
kubectl get rayjob stays in Initializing indefinitely; the head Pod is 2/3 / Error; the RayCluster never becomes ready; no teardown happens.

Anything else

Unfixed on master (verified) and v1.6.x. Related to #4637 (fail-fast on deterministic pre-Running failures). The opt-in, time-based spec.preRunningDeadlineSeconds is only a partial mitigation: it reacts after a wall-clock timeout and reports PreRunningDeadlineExceeded rather than the real submitter failure.

I have a fix and will open a PR that detects the submitter failure during Initializing.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] SidecarMode RayJob stuck in Initializing forever when the submitter fails before the RayCluster is Ready #4945

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] SidecarMode RayJob stuck in Initializing forever when the submitter fails before the RayCluster is Ready #4945

Description

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions