Problem
BuildJobSubmitCommand in SidecarMode waits only for api/gcs_healthz before running ray job submit (ray-operator/controllers/ray/common/job.go:127-141, v1.6.0). gcs_healthz reflects GCS-process liveness (Ray reporter_head.py → check_gcs_liveness → async_check_alive), not whether the head node is schedulable.
When a RayJob's entrypoint runs as the default head-pinned driver (Ray pins the JobSupervisor actor with NodeAffinitySchedulingStrategy(head, soft=False) when no entrypoint resources are set), there is a bootstrap window where GCS is alive but the head node is not yet in the GCS actor-scheduler resource view, so the supervisor actor is transiently infeasible and the job fails fast (no soft=False fallback, no retry). The sidecar submits during that window because gcs_healthz is already green.
Dependency
Requires a new Ray endpoint proposed in ray-project/ray#64285 (api/node_schedulable_healthz), which surfaces resource-view membership (gcs_resource_manager.cc:130) — the same view the hard-affinity scheduler reads (node_affinity_scheduling_policy.cc:25), inserted before scheduling (gcs_server.cc:736-738). No reachable HTTP route / GcsClient method exposes this today, so this can't be fixed in KubeRay alone.
Change
- Add
RayNodeSchedulableHealthPath = "api/node_schedulable_healthz" to ray-operator/controllers/ray/utils/constant.go (next to RayDashboardGCSHealthPath).
- In
BuildJobSubmitCommand, after the existing GCS waitLoop, append a second until-loop on the new path using utils.BasePythonHealthCommand (SidecarMode-only, self-contained to this function).
- Gate it on Ray version (or a RayJob opt-in field): the route 404s on older Ray and the probe greps for body
success, so an unconditional loop would hang forever against an old Ray.
Tests
Update ray-operator/controllers/ray/common/job_test.go:
TestBuildJobSubmitCommandWithSidecarMode (exact-slice assert.Equal) — insert the new loop into the expected slice after the GCS loop.
TestBuildJobSubmitCommandWithSidecarModeCustomDashboardPort — parallel assertion for the schedulable path.
TestBuildJobSubmitCommandWithK8sJobModeNoSidecarHealthWaitLoop — add NotContains for the new path.
- Extend e2e
test/e2erayjob/rayjob_sidecar_mode_test.go to assert the entrypoint actor lands on the head.
Note for maintainers
master has refactored BuildJobSubmitCommand (the GCS loop moved out of the SidecarMode block; BasePythonHealthCommand takes (url, timeout); a features.SidecarSubmitterRestart gate exists). A master-targeted change should append the second loop after cmd = append(cmd, waitLoop...), SidecarMode-only, reusing the healthURL form.
This is an alternative to the two upstream-recommended remedies that both move the driver off the head (RAY_JOB_ALLOW_DRIVER_ON_WORKER_NODES=1 or entrypoint_num_cpus) — it keeps the driver on the head (the Ray Jobs default) and just gates submission on real placeability.
Filed with assistance from automated source analysis (Claude Code); citations from KubeRay v1.6.0.
Problem
BuildJobSubmitCommandin SidecarMode waits only forapi/gcs_healthzbefore runningray job submit(ray-operator/controllers/ray/common/job.go:127-141, v1.6.0).gcs_healthzreflects GCS-process liveness (Rayreporter_head.py→check_gcs_liveness→async_check_alive), not whether the head node is schedulable.When a RayJob's entrypoint runs as the default head-pinned driver (Ray pins the JobSupervisor actor with
NodeAffinitySchedulingStrategy(head, soft=False)when no entrypoint resources are set), there is a bootstrap window where GCS is alive but the head node is not yet in the GCS actor-scheduler resource view, so the supervisor actor is transiently infeasible and the job fails fast (nosoft=Falsefallback, no retry). The sidecar submits during that window becausegcs_healthzis already green.Dependency
Requires a new Ray endpoint proposed in ray-project/ray#64285 (
api/node_schedulable_healthz), which surfaces resource-view membership (gcs_resource_manager.cc:130) — the same view the hard-affinity scheduler reads (node_affinity_scheduling_policy.cc:25), inserted before scheduling (gcs_server.cc:736-738). No reachable HTTP route /GcsClientmethod exposes this today, so this can't be fixed in KubeRay alone.Change
RayNodeSchedulableHealthPath = "api/node_schedulable_healthz"toray-operator/controllers/ray/utils/constant.go(next toRayDashboardGCSHealthPath).BuildJobSubmitCommand, after the existing GCSwaitLoop, append a seconduntil-loop on the new path usingutils.BasePythonHealthCommand(SidecarMode-only, self-contained to this function).success, so an unconditional loop would hang forever against an old Ray.Tests
Update
ray-operator/controllers/ray/common/job_test.go:TestBuildJobSubmitCommandWithSidecarMode(exact-sliceassert.Equal) — insert the new loop into the expected slice after the GCS loop.TestBuildJobSubmitCommandWithSidecarModeCustomDashboardPort— parallel assertion for the schedulable path.TestBuildJobSubmitCommandWithK8sJobModeNoSidecarHealthWaitLoop— addNotContainsfor the new path.test/e2erayjob/rayjob_sidecar_mode_test.goto assert the entrypoint actor lands on the head.Note for maintainers
masterhas refactoredBuildJobSubmitCommand(the GCS loop moved out of the SidecarMode block;BasePythonHealthCommandtakes(url, timeout); afeatures.SidecarSubmitterRestartgate exists). A master-targeted change should append the second loop aftercmd = append(cmd, waitLoop...), SidecarMode-only, reusing thehealthURLform.This is an alternative to the two upstream-recommended remedies that both move the driver off the head (
RAY_JOB_ALLOW_DRIVER_ON_WORKER_NODES=1orentrypoint_num_cpus) — it keeps the driver on the head (the Ray Jobs default) and just gates submission on real placeability.Filed with assistance from automated source analysis (Claude Code); citations from KubeRay
v1.6.0.