Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When a RayJob uses submissionMode: SidecarMode, the ray-job-submitter runs as a regular container inside the head Pod, and the head Pod's RestartPolicy is forced to Never.
If the submitter container exits non-zero before the RayCluster reaches the Ready state — e.g. the Ray job fails while worker Pods are still being scheduled (we hit this with a gang scheduler holding workers Pending) — the RayJob becomes permanently wedged in jobDeploymentStatus: Initializing and never transitions to Failed. Consequently shutdownAfterJobFinishes never runs, the RayCluster is never torn down, and the autoscaler keeps trying to bring up workers.
Root cause (reproduced on master and v1.6.x):
- A terminated submitter container (head Pod
RestartPolicy=Never) keeps the head Pod Ready=False ("containers with unready status: [ray-job-submitter]"), because head-pod readiness considers every container in the head Pod.
- So
RayCluster.Status.State never becomes Ready.
- The RayJob
Initializing -> Running transition is hard-gated on rayClusterInstance.Status.State == rayv1.Ready (the JobDeploymentStatusInitializing case logs "Wait for the RayCluster.Status.State to be ready before submitting the job." and breaks every reconcile).
- The only submitter-failure detector,
checkSubmitterAndUpdateStatusIfNeeded, runs only in the Running case — unreachable while the cluster is not Ready.
So the submitter's failure is fully observable during Initializing (the container is Terminated with a non-zero exit code), but the controller never inspects it there, and the RayJob deadlocks.
Expected: a SidecarMode submitter that fails terminally should drive the RayJob to Failed even if the RayCluster never reached Ready, so cleanup / shutdownAfterJobFinishes / backoffLimit proceed normally.
Operator log (repeats forever):
controllers.RayJob Wait for the RayCluster.Status.State to be ready before submitting the job. ... State:""
with RayCluster conditions HeadPodReady=False (reason Error, "containers with unready status: [ray-job-submitter]") and RayClusterProvisioned=False.
Reproduction script
- Submit a RayJob with
submissionMode: SidecarMode whose entrypoint exits non-zero, on a cluster where worker Pods are slow to schedule (or use a gang scheduler that holds workers Pending) so the RayCluster does not reach Ready before the submitter exits.
kubectl get rayjob stays in Initializing indefinitely; the head Pod is 2/3 / Error; the RayCluster never becomes ready; no teardown happens.
Anything else
Unfixed on master (verified) and v1.6.x. Related to #4637 (fail-fast on deterministic pre-Running failures). The opt-in, time-based spec.preRunningDeadlineSeconds is only a partial mitigation: it reacts after a wall-clock timeout and reports PreRunningDeadlineExceeded rather than the real submitter failure.
I have a fix and will open a PR that detects the submitter failure during Initializing.
Are you willing to submit a PR?
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When a RayJob uses
submissionMode: SidecarMode, theray-job-submitterruns as a regular container inside the head Pod, and the head Pod'sRestartPolicyis forced toNever.If the submitter container exits non-zero before the RayCluster reaches the
Readystate — e.g. the Ray job fails while worker Pods are still being scheduled (we hit this with a gang scheduler holding workersPending) — the RayJob becomes permanently wedged injobDeploymentStatus: Initializingand never transitions toFailed. ConsequentlyshutdownAfterJobFinishesnever runs, the RayCluster is never torn down, and the autoscaler keeps trying to bring up workers.Root cause (reproduced on
masterand v1.6.x):RestartPolicy=Never) keeps the head PodReady=False("containers with unready status: [ray-job-submitter]"), because head-pod readiness considers every container in the head Pod.RayCluster.Status.Statenever becomesReady.Initializing -> Runningtransition is hard-gated onrayClusterInstance.Status.State == rayv1.Ready(theJobDeploymentStatusInitializingcase logs "Wait for the RayCluster.Status.State to be ready before submitting the job." and breaks every reconcile).checkSubmitterAndUpdateStatusIfNeeded, runs only in theRunningcase — unreachable while the cluster is notReady.So the submitter's failure is fully observable during
Initializing(the container isTerminatedwith a non-zero exit code), but the controller never inspects it there, and the RayJob deadlocks.Expected: a SidecarMode submitter that fails terminally should drive the RayJob to
Failedeven if the RayCluster never reachedReady, so cleanup /shutdownAfterJobFinishes/backoffLimitproceed normally.Operator log (repeats forever):
with RayCluster conditions
HeadPodReady=False(reasonError,"containers with unready status: [ray-job-submitter]") andRayClusterProvisioned=False.Reproduction script
submissionMode: SidecarModewhose entrypoint exits non-zero, on a cluster where worker Pods are slow to schedule (or use a gang scheduler that holds workersPending) so the RayCluster does not reachReadybefore the submitter exits.kubectl get rayjobstays inInitializingindefinitely; the head Pod is2/3/Error; the RayCluster never becomesready; no teardown happens.Anything else
Unfixed on
master(verified) and v1.6.x. Related to #4637 (fail-fast on deterministic pre-Running failures). The opt-in, time-basedspec.preRunningDeadlineSecondsis only a partial mitigation: it reacts after a wall-clock timeout and reportsPreRunningDeadlineExceededrather than the real submitter failure.I have a fix and will open a PR that detects the submitter failure during
Initializing.Are you willing to submit a PR?