[Data] Fix get_or_create_stats_actor crash in Ray Client mode#63402
Conversation
Signed-off-by: Yuang Gao <yg2315@nyu.edu>
There was a problem hiding this comment.
Code Review
This pull request updates get_or_create_stats_actor in python/ray/data/_internal/stats.py to check ray.is_initialized() instead of the global node's presence, ensuring compatibility with Ray Client where the global node may be None. It also includes a new test case in python/ray/data/tests/test_stats.py to validate this fix. I have no feedback to provide.
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
Friendly ping — CI is green and the fix is minimal (4-line change to replace an internal |
|
@bveeramani @goutamvenkat-anyscale quick bump |
| def test_get_or_create_stats_actor_in_client_mode(monkeypatch): | ||
| """``get_or_create_stats_actor`` must not raise ``RuntimeError`` when | ||
| ``ray._private.worker._global_node`` is ``None`` while Ray itself is | ||
| initialized, which is the case for drivers connected via Ray Client. | ||
| """ | ||
| monkeypatch.setattr(ray, "is_initialized", lambda: True) | ||
| monkeypatch.setattr(ray._private.worker, "_global_node", None) | ||
|
|
||
| fake_ctx = MagicMock() | ||
| fake_ctx.get_node_id.return_value = "fake_node_id" | ||
| monkeypatch.setattr(ray, "get_runtime_context", lambda: fake_ctx) | ||
|
|
||
| fake_handle = MagicMock(name="StatsActorHandle") | ||
| fake_options = MagicMock() | ||
| fake_options.remote.return_value = fake_handle | ||
| monkeypatch.setattr(_StatsActor, "options", lambda **kwargs: fake_options) | ||
|
|
||
| assert get_or_create_stats_actor() is fake_handle | ||
|
|
||
|
|
There was a problem hiding this comment.
This is a brittle test that relies heavily on mocking internals.
Could we nuke this? Ray Data doesn't officially support Ray Client, so I don't think this test is worth the maintenance cost
There was a problem hiding this comment.
Done — removed the test
| global_node = ray._private.worker._global_node | ||
| if global_node is not None: |
There was a problem hiding this comment.
Can you add a comment explaining how global_node can be None? Don't think it'll be obvious to future readers
There was a problem hiding this comment.
Done — added a comment
Signed-off-by: Yuang Gao <yg2315@nyu.edu>
…oject#63402) ## Description In Ray Client mode, `ray._private.worker._global_node` is `None` because the client driver is not a Ray worker process, even though `ray.is_initialized()` is `True` and the cluster is connected. `get_or_create_stats_actor` used `_global_node` as a proxy for "connected to Ray" and raised `RuntimeError` whenever Ray Data tried to register or query the stats actor, causing `ds.take_batch()`, `ds.iter_batches()`, etc. to crash on materialized datasets. Use `ray.is_initialized()` for the connection check and only emit the `cluster_id` debug log when `_global_node` is available, since `cluster_id` is not exposed via `ray.get_runtime_context()`. ## Related issues Closes ray-project#61162 ## Additional information --------- Signed-off-by: Yuang Gao <yg2315@nyu.edu>
Description
In Ray Client mode,
ray._private.worker._global_nodeisNonebecause the client driver is not a Ray worker process, even thoughray.is_initialized()isTrueand the cluster is connected.get_or_create_stats_actorused_global_nodeas a proxy for "connected to Ray" and raisedRuntimeErrorwhenever Ray Data tried to register or query the stats actor, causingds.take_batch(),ds.iter_batches(), etc. to crash on materialized datasets.Use
ray.is_initialized()for the connection check and only emit thecluster_iddebug log when_global_nodeis available, sincecluster_idis not exposed viaray.get_runtime_context().Related issues
Closes #61162
Additional information