[Core] Fix ray stop failing to terminate dashboard and runtime_env agents on Windows#62428
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses an issue on Windows where setproctitle fails to update process names, preventing the autoscaler from correctly identifying dashboard and runtime environment agents. The changes introduce fallback entries in RAY_PROCESSES that match the actual script paths on Windows and add a corresponding unit test. The reviewer suggested that substring matching might be fragile due to path separator inconsistencies and recommended using a try...finally block or a pytest fixture for the module reloading logic in the test to ensure global state restoration.
| import importlib | ||
| import ray.autoscaler._private.constants as constants_mod | ||
|
|
||
| # Reload with sys.platform patched to "win32" so that the conditional | ||
| # fallback entries are included in RAY_PROCESSES. | ||
| with mock.patch("sys.platform", "win32"): | ||
| importlib.reload(constants_mod) | ||
| win32_ray_processes = list(constants_mod.RAY_PROCESSES) | ||
|
|
||
| # Restore the module to its original state. | ||
| importlib.reload(constants_mod) |
There was a problem hiding this comment.
The use of importlib.reload on a central module like constants can have side effects if other modules have already imported RAY_PROCESSES (as they will retain a reference to the old list object). While this is acceptable for an isolated unit test, it's worth noting that the second reload (line 282) is crucial to restore the global state for any subsequent tests in the same process. To make this more robust against test failures, consider wrapping the reload logic in a try...finally block or using a pytest fixture with a teardown.
…agents on Windows (ray-project#61452) On Windows, `setproctitle` does not change the process name or command line visible to psutil — it only creates a named Windows event object. This means the existing `RAY_PROCESSES` entries using "ray::DashboardAgent" and "ray::RuntimeEnvAgent" keywords never match any process, leaving these agents orphaned after `ray stop`. Add Windows-only fallback entries that match the actual script paths ("dashboard/agent.py" and "runtime_env/agent/main.py") in the command line. These are guarded by `sys.platform == "win32"` to avoid double-matching on Linux/macOS where setproctitle works correctly. Signed-off-by: jamesrayammons <jamesrayammons@outlook.com>
Signed-off-by: jamesrayammons <jamesrayammons@outlook.com>
| assert len(ready) == 1000, len(ready) | ||
|
|
||
|
|
||
| def test_ray_processes_can_match_agents_on_windows(): |
There was a problem hiding this comment.
this test is not really that useful. instead, could we add an integration test that runs on windows that actually runs ray start and ray stop and ensures that these processes exit? it could live in test_cli.py and you can mark it to only run on windows using pytest.mark.skipif(sys.platform != "win32") (see other examples in the repo)
Move the agent-matching test from test_basic_4.py to test_cli.py as a proper integration test that runs `ray start` / `ray stop` and verifies the dashboard and runtime_env agent processes actually exit. The test is marked with skipif(sys.platform != "win32") so it only runs on Windows. Signed-off-by: jamesrayammons <jamesrayammons@outlook.com>
|
Hi @edoakes thanks for your feedback. I've solved, please check it. |
edoakes
left a comment
There was a problem hiding this comment.
Triggered the windows CI tests, will merge if/when those pass
|
Thanks for the contribution @jamesrayammons |
…agents on Windows (ray-project#62428) Fix `ray stop` on Windows failing to terminate dashboard agent (`ray/dashboard/agent.py`) and runtime_env agent (`ray/_private/runtime_env/agent/main.py`) processes. On Windows, `setproctitle` does not change the process name or command line visible to `psutil` — it only creates a named Windows event object (see `spt_status.c`). This means the existing `RAY_PROCESSES` entries using `"ray::DashboardAgent"` and `"ray::RuntimeEnvAgent"` keywords never match any process during `ray stop`, leaving these agents orphaned. The command exits with code 0 and reports "Stopped all N Ray processes" while agents continue running in the background, holding file handles and blocking cleanup operations. This PR adds Windows-only fallback entries to `RAY_PROCESSES` that match the actual script paths (`"dashboard/agent.py"` and `"runtime_env/agent/main.py"`) in the process command line. These entries are guarded by `sys.platform == "win32"` to avoid double-matching on Linux/macOS where `setproctitle` works correctly. A unit test is included that verifies the matching logic works for simulated Windows process command lines. ### Changes - **`python/ray/autoscaler/_private/constants.py`**: Add two conditional entries to `RAY_PROCESSES` for Windows that match agents by script path instead of proctitle. - **`python/ray/tests/test_basic_4.py`**: Add `test_ray_processes_can_match_agents_on_windows` to verify the fallback entries are present on Windows and correctly match simulated agent command lines. ## Related issues Fixes ray-project#61452 ## Additional information ### Root cause The process-matching logic in `scripts.py` `kill_procs()` searches for agent processes by checking if keywords like `"ray::DashboardAgent"` appear in `subprocess.list2cmdline(proc.cmdline())`. On Linux/macOS, `setproctitle()` changes the process cmdline so this works. On Windows, `setproctitle()` only creates a named event object (`CreateEvent`) — `psutil.Process.cmdline()` still returns the original invocation (e.g., `python.exe .../dashboard/agent.py --port ...`), so the keyword never matches. ### Why not change the existing entries? The existing `"ray::DashboardAgent"` / `"ray::RuntimeEnvAgent"` entries work correctly on Linux and macOS. Replacing them with script-path matching would change behavior on those platforms. Adding platform-conditional fallback entries is the minimal, safe fix. ### Testing - Unit test patches `sys.platform` to `"win32"`, reloads the constants module, and verifies that simulated Windows agent command lines are matched. - Verified that `RAY_PROCESSES` invariants are preserved (raylet first, gcs_server last). - Verified no false positives on unrelated processes. - Linux/macOS behavior is unchanged (fallback entries are not included). --------- Signed-off-by: jamesrayammons <jamesrayammons@outlook.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…agents on Windows (ray-project#62428) Fix `ray stop` on Windows failing to terminate dashboard agent (`ray/dashboard/agent.py`) and runtime_env agent (`ray/_private/runtime_env/agent/main.py`) processes. On Windows, `setproctitle` does not change the process name or command line visible to `psutil` — it only creates a named Windows event object (see `spt_status.c`). This means the existing `RAY_PROCESSES` entries using `"ray::DashboardAgent"` and `"ray::RuntimeEnvAgent"` keywords never match any process during `ray stop`, leaving these agents orphaned. The command exits with code 0 and reports "Stopped all N Ray processes" while agents continue running in the background, holding file handles and blocking cleanup operations. This PR adds Windows-only fallback entries to `RAY_PROCESSES` that match the actual script paths (`"dashboard/agent.py"` and `"runtime_env/agent/main.py"`) in the process command line. These entries are guarded by `sys.platform == "win32"` to avoid double-matching on Linux/macOS where `setproctitle` works correctly. A unit test is included that verifies the matching logic works for simulated Windows process command lines. ### Changes - **`python/ray/autoscaler/_private/constants.py`**: Add two conditional entries to `RAY_PROCESSES` for Windows that match agents by script path instead of proctitle. - **`python/ray/tests/test_basic_4.py`**: Add `test_ray_processes_can_match_agents_on_windows` to verify the fallback entries are present on Windows and correctly match simulated agent command lines. ## Related issues Fixes ray-project#61452 ## Additional information ### Root cause The process-matching logic in `scripts.py` `kill_procs()` searches for agent processes by checking if keywords like `"ray::DashboardAgent"` appear in `subprocess.list2cmdline(proc.cmdline())`. On Linux/macOS, `setproctitle()` changes the process cmdline so this works. On Windows, `setproctitle()` only creates a named event object (`CreateEvent`) — `psutil.Process.cmdline()` still returns the original invocation (e.g., `python.exe .../dashboard/agent.py --port ...`), so the keyword never matches. ### Why not change the existing entries? The existing `"ray::DashboardAgent"` / `"ray::RuntimeEnvAgent"` entries work correctly on Linux and macOS. Replacing them with script-path matching would change behavior on those platforms. Adding platform-conditional fallback entries is the minimal, safe fix. ### Testing - Unit test patches `sys.platform` to `"win32"`, reloads the constants module, and verifies that simulated Windows agent command lines are matched. - Verified that `RAY_PROCESSES` invariants are preserved (raylet first, gcs_server last). - Verified no false positives on unrelated processes. - Linux/macOS behavior is unchanged (fallback entries are not included). --------- Signed-off-by: jamesrayammons <jamesrayammons@outlook.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Description
Fix
ray stopon Windows failing to terminate dashboard agent (ray/dashboard/agent.py) and runtime_env agent (ray/_private/runtime_env/agent/main.py) processes.On Windows,
setproctitledoes not change the process name or command line visible topsutil— it only creates a named Windows event object (seespt_status.c). This means the existingRAY_PROCESSESentries using"ray::DashboardAgent"and"ray::RuntimeEnvAgent"keywords never match any process duringray stop, leaving these agents orphaned. The command exits with code 0 and reports "Stopped all N Ray processes" while agents continue running in the background, holding file handles and blocking cleanup operations.This PR adds Windows-only fallback entries to
RAY_PROCESSESthat match the actual script paths ("dashboard/agent.py"and"runtime_env/agent/main.py") in the process command line. These entries are guarded bysys.platform == "win32"to avoid double-matching on Linux/macOS wheresetproctitleworks correctly. A unit test is included that verifies the matching logic works for simulated Windows process command lines.Changes
python/ray/autoscaler/_private/constants.py: Add two conditional entries toRAY_PROCESSESfor Windows that match agents by script path instead of proctitle.python/ray/tests/test_basic_4.py: Addtest_ray_processes_can_match_agents_on_windowsto verify the fallback entries are present on Windows and correctly match simulated agent command lines.Related issues
Fixes #61452
Additional information
Root cause
The process-matching logic in
scripts.pykill_procs()searches for agent processes by checking if keywords like"ray::DashboardAgent"appear insubprocess.list2cmdline(proc.cmdline()). On Linux/macOS,setproctitle()changes the process cmdline so this works. On Windows,setproctitle()only creates a named event object (CreateEvent) —psutil.Process.cmdline()still returns the original invocation (e.g.,python.exe .../dashboard/agent.py --port ...), so the keyword never matches.Why not change the existing entries?
The existing
"ray::DashboardAgent"/"ray::RuntimeEnvAgent"entries work correctly on Linux and macOS. Replacing them with script-path matching would change behavior on those platforms. Adding platform-conditional fallback entries is the minimal, safe fix.Testing
sys.platformto"win32", reloads the constants module, and verifies that simulated Windows agent command lines are matched.RAY_PROCESSESinvariants are preserved (raylet first, gcs_server last).