Skip to content

[serve] Add ControllerOptions for configurable controller runtime_env#63352

Merged
kouroshHakha merged 4 commits into
ray-project:masterfrom
kouroshHakha:kh/serve-controller-options
May 15, 2026
Merged

[serve] Add ControllerOptions for configurable controller runtime_env#63352
kouroshHakha merged 4 commits into
ray-project:masterfrom
kouroshHakha:kh/serve-controller-options

Conversation

@kouroshHakha

@kouroshHakha kouroshHakha commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add new public ControllerOptions config object (alpha), symmetric with HTTPOptions / gRPCOptions, that lets serve.start() / serve.run() / serve run foo.yaml pass a strictly-validated runtime_env into the Serve controller actor.
  • v0 scope: runtime_env only, and within runtime_env only the env_vars key is accepted. Other keys (pip, working_dir, py_modules, container, ...) would mutate Serve's own dependencies on a detached, long-lived controller actor and are rejected with a message pointing operators at deployment-level runtime_env.

Motivation

Today if there are env variables that need to be passed to the HAProxy layer (e.g. RAY_SERVE_HAPROXY_TCP_NO_DELAY) we have to set them at cluster level, instead of doing something like RAY_SERVE_HAPROXY_TCP_NO_DELAY=1 serve run foo.yaml or setting the env var in runtime envs in the yaml. which would be much more convenient for tuning.

API

# Python
serve.start(
    controller_options=ControllerOptions(
        runtime_env={\"env_vars\": {\"RAY_SERVE_HAPROXY_NBTHREAD\": \"16\"}},
    ),
)

serve.run(
    app,
    controller_options={\"runtime_env\": {\"env_vars\": {\"FOO\": \"bar\"}}},  # dict also accepted
)
# serve run foo.yaml
controller_options:
  runtime_env:
    env_vars:
      RAY_SERVE_HAPROXY_TCP_NODELAY: \"1\"
      RAY_SERVE_HAPROXY_NBTHREAD: \"16\"

http_options:
  host: 0.0.0.0
  port: 8000

applications:
  - ...

Validator catches typos and disallowed fields at parse time:

>>> ControllerOptions(runtime_env={\"pip\": [\"numpy\"]})
ValidationError: ControllerOptions.runtime_env only supports ['env_vars'] in this
version; got disallowed keys ['pip']. Per-replica runtime_env belongs on the
deployment (serve.deployment(runtime_env=...)), not on the controller actor.

>>> ControllerOptions(runtime_env={\"env_vars\": {\"X\": 1}})
ValidationError: env_vars['X'] must be str (got int); coerce explicitly.

Test plan

  • python/ray/serve/tests/unit/test_config.py::TestControllerOptions — 12 methods (parametrized) covering accept/reject paths: default None, dict coerce via model_validate, valid env_vars, empty env_vars, every non-env_vars runtime_env key (pip, working_dir, py_modules, conda, container, nsight), non-dict runtime_env, non-dict env_vars, non-str values across types, empty/non-string env-var keys, extra top-level fields, mixed allowed-and-disallowed keys.
  • python/ray/serve/tests/unit/test_config.py::TestGetControllerImpl — 3 cases asserting runtime_env correctly threaded into the actor class's _default_options (or omitted when not requested).
  • python/ray/serve/tests/unit/test_schema.py::TestServeDeploySchema — 4 cases on the YAML schema: default-None, valid passthrough, rejects disallowed runtime_env keys, rejects non-str env values.
  • python/ray/serve/tests/test_standalone.py::test_serve_start_controller_options — parametrized e2e (model + dict input) that asserts the requested env vars actually land on the live controller actor's os.environ.
  • python/ray/serve/tests/test_standalone.py::test_serve_start_controller_options_rejects_disallowed_runtime_env — verifies bad runtime_env raises ValidationError at the caller, not from a Ray task.

🤖 Generated with Claude Code

Today the only way to influence the Serve controller actor's environment
is to set Ray cluster env vars at start time and hope they're on the
Anyscale runtime-env hook's propagation allowlist. Knobs like
RAY_SERVE_HAPROXY_NBTHREAD and RAY_SERVE_HAPROXY_TCP_NODELAY were
silently dropped, blocking experiments and operator overrides.

Add ControllerOptions, a public config object symmetric with HTTPOptions
and gRPCOptions, that carries a strictly-validated runtime_env for the
controller actor. v0 scope is intentionally narrow: only the env_vars
key under runtime_env is accepted. Other keys (pip, working_dir,
py_modules, container, ...) would mutate the detached, long-lived
controller's dependencies and are rejected with a message pointing
operators at deployment-level runtime_env.

Plumbed through:
- serve.start(controller_options=...)
- serve.run(..., controller_options=...) (and _run / _run_many / run_many)
- serve run foo.yaml via ServeDeploySchema.controller_options
- get_controller_impl() applies it to the controller actor's runtime_env

Reuses Anyscale's env_hook merge semantics: explicit runtime_env.env_vars
land additively on top of the hook's auto-injected set.

Same lifecycle as HTTPOptions: only applied on first controller creation;
ignored with a log warning if a controller is already running.

Tests:
- TestControllerOptions in tests/unit/test_config.py (12 methods,
  parametrized -- 24 cases total) for the validator
- TestGetControllerImpl in the same file (3 cases) for white-box
  wiring into the actor class
- 4 cases on TestServeDeploySchema in tests/unit/test_schema.py for
  YAML-schema integration
- test_serve_start_controller_options (parametrized over model and
  dict input) and test_serve_start_controller_options_rejects_disallowed_runtime_env
  in tests/test_standalone.py for live env-propagation end-to-end

Verified end-to-end on a Ray Serve LLM + HAProxy stack: serve.start with
ControllerOptions(runtime_env={"env_vars": {"RAY_SERVE_HAPROXY_TCP_NODELAY": "1"}})
lands the env var on the controller's /proc/<pid>/environ and the
rendered haproxy.cfg picks up `option http-no-delay`. That flip cut
c=64 streaming TTFT mean from 201 ms to 98 ms, matching vllm-router
(104 ms) on the same workload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kouroshHakha kouroshHakha requested a review from a team as a code owner May 14, 2026 22:05

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces ControllerOptions to allow passing configuration, specifically runtime_env.env_vars, to the Ray Serve controller actor during initialization. This new configuration is integrated into serve.start, serve.run, the YAML deployment schema, and the CLI. Feedback includes a suggestion to use the built-in dict type for consistency in type hints and a recommendation to improve the validation logic for env_vars to correctly handle cases where the key might be explicitly set to null.

Comment thread python/ray/serve/_private/api.py Outdated
Comment thread python/ray/serve/config.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 010a8c6. Configure here.

Comment thread python/ray/serve/_private/api.py
@kouroshHakha kouroshHakha marked this pull request as draft May 14, 2026 22:34
- api.py: use ``dict`` instead of ``Dict`` in ``_coerce_controller_options``
  signature to match the lowercase-``dict`` style used by neighboring
  Union annotations.
- config.py: reject explicit ``env_vars: None`` (e.g., from YAML ``null``)
  by checking key presence with ``in`` instead of ``dict.get``, so a bad
  config fails locally with a ValidationError rather than crashing later
  in the Ray runtime_env layer. Added a regression test.
- serve_head.py: forward ``config.controller_options`` from
  ``put_all_applications`` to ``serve_start_async`` -- previously the
  REST API path silently dropped controller_options from the request
  body even though schema validation accepted them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label May 15, 2026
@kouroshHakha kouroshHakha marked this pull request as ready for review May 15, 2026 06:37
@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label May 15, 2026
Fixes ci/lint/lint.sh api_policy_check: every @publicapi in ray.serve
must appear in doc/source/serve/api/index.md. ControllerOptions was
added alongside HTTPOptions/gRPCOptions in the new commit but missed
from the docs index.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread python/ray/serve/_private/default_impl.py Outdated
Addresses review feedback: move the ControllerOptions import from the
TYPE_CHECKING block at the bottom of the file to the regular imports
at the top. ray.serve.config has no import dependency on
ray.serve._private.default_impl, so a runtime import is safe and lets
us also drop the string annotation on ``get_controller_impl``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kouroshHakha kouroshHakha enabled auto-merge (squash) May 15, 2026 17:26
@kouroshHakha kouroshHakha merged commit db33c04 into ray-project:master May 15, 2026
7 checks passed
TruongQuangPhat pushed a commit to cyhapun/ray-fix-issue that referenced this pull request May 27, 2026
…ray-project#63352)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

2 participants