Skip to content

[RayJob] Sidecar submitter: wait for head-node schedulability before submitting (close hard-affinity placement race) #4943

Description

@tmrtmrtmrtmr

Problem

BuildJobSubmitCommand in SidecarMode waits only for api/gcs_healthz before running ray job submit (ray-operator/controllers/ray/common/job.go:127-141, v1.6.0). gcs_healthz reflects GCS-process liveness (Ray reporter_head.pycheck_gcs_livenessasync_check_alive), not whether the head node is schedulable.

When a RayJob's entrypoint runs as the default head-pinned driver (Ray pins the JobSupervisor actor with NodeAffinitySchedulingStrategy(head, soft=False) when no entrypoint resources are set), there is a bootstrap window where GCS is alive but the head node is not yet in the GCS actor-scheduler resource view, so the supervisor actor is transiently infeasible and the job fails fast (no soft=False fallback, no retry). The sidecar submits during that window because gcs_healthz is already green.

Dependency

Requires a new Ray endpoint proposed in ray-project/ray#64285 (api/node_schedulable_healthz), which surfaces resource-view membership (gcs_resource_manager.cc:130) — the same view the hard-affinity scheduler reads (node_affinity_scheduling_policy.cc:25), inserted before scheduling (gcs_server.cc:736-738). No reachable HTTP route / GcsClient method exposes this today, so this can't be fixed in KubeRay alone.

Change

  • Add RayNodeSchedulableHealthPath = "api/node_schedulable_healthz" to ray-operator/controllers/ray/utils/constant.go (next to RayDashboardGCSHealthPath).
  • In BuildJobSubmitCommand, after the existing GCS waitLoop, append a second until-loop on the new path using utils.BasePythonHealthCommand (SidecarMode-only, self-contained to this function).
  • Gate it on Ray version (or a RayJob opt-in field): the route 404s on older Ray and the probe greps for body success, so an unconditional loop would hang forever against an old Ray.

Tests

Update ray-operator/controllers/ray/common/job_test.go:

  • TestBuildJobSubmitCommandWithSidecarMode (exact-slice assert.Equal) — insert the new loop into the expected slice after the GCS loop.
  • TestBuildJobSubmitCommandWithSidecarModeCustomDashboardPort — parallel assertion for the schedulable path.
  • TestBuildJobSubmitCommandWithK8sJobModeNoSidecarHealthWaitLoop — add NotContains for the new path.
  • Extend e2e test/e2erayjob/rayjob_sidecar_mode_test.go to assert the entrypoint actor lands on the head.

Note for maintainers

master has refactored BuildJobSubmitCommand (the GCS loop moved out of the SidecarMode block; BasePythonHealthCommand takes (url, timeout); a features.SidecarSubmitterRestart gate exists). A master-targeted change should append the second loop after cmd = append(cmd, waitLoop...), SidecarMode-only, reusing the healthURL form.

This is an alternative to the two upstream-recommended remedies that both move the driver off the head (RAY_JOB_ALLOW_DRIVER_ON_WORKER_NODES=1 or entrypoint_num_cpus) — it keeps the driver on the head (the Ray Jobs default) and just gates submission on real placeability.


Filed with assistance from automated source analysis (Claude Code); citations from KubeRay v1.6.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions