Skip to content

[Data] Fix get_or_create_stats_actor crash in Ray Client mode#63402

Merged
bveeramani merged 2 commits into
ray-project:masterfrom
YuangGao:fix/61162-stats-actor-client-mode
Jun 7, 2026
Merged

[Data] Fix get_or_create_stats_actor crash in Ray Client mode#63402
bveeramani merged 2 commits into
ray-project:masterfrom
YuangGao:fix/61162-stats-actor-client-mode

Conversation

@YuangGao

@YuangGao YuangGao commented May 16, 2026

Copy link
Copy Markdown
Contributor

Description

In Ray Client mode, ray._private.worker._global_node is None because the client driver is not a Ray worker process, even though ray.is_initialized() is True and the cluster is connected. get_or_create_stats_actor used
_global_node as a proxy for "connected to Ray" and raised RuntimeError whenever Ray Data tried to register or query the stats actor, causing ds.take_batch(), ds.iter_batches(), etc. to crash on materialized datasets.

Use ray.is_initialized() for the connection check and only emit the cluster_id debug log when _global_node is available, since cluster_id is not exposed via ray.get_runtime_context().

Related issues

Closes #61162

Additional information

Signed-off-by: Yuang Gao <yg2315@nyu.edu>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates get_or_create_stats_actor in python/ray/data/_internal/stats.py to check ray.is_initialized() instead of the global node's presence, ensuring compatibility with Ray Client where the global node may be None. It also includes a new test case in python/ray/data/tests/test_stats.py to validate this fix. I have no feedback to provide.

@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels May 17, 2026
@github-actions

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 31, 2026
@YuangGao

Copy link
Copy Markdown
Contributor Author

Friendly ping — CI is green and the fix is minimal (4-line change to replace an internal _global_node check with ray.is_initialized() so Ray Data works under Ray Client). Would either of you mind taking a look?
@goutamvenkat-anyscale @bveeramani

@github-actions github-actions Bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Jun 1, 2026
@jcotant1

jcotant1 commented Jun 5, 2026

Copy link
Copy Markdown
Member
Comment thread python/ray/data/tests/test_stats.py Outdated
Comment on lines +2088 to +2107
def test_get_or_create_stats_actor_in_client_mode(monkeypatch):
"""``get_or_create_stats_actor`` must not raise ``RuntimeError`` when
``ray._private.worker._global_node`` is ``None`` while Ray itself is
initialized, which is the case for drivers connected via Ray Client.
"""
monkeypatch.setattr(ray, "is_initialized", lambda: True)
monkeypatch.setattr(ray._private.worker, "_global_node", None)

fake_ctx = MagicMock()
fake_ctx.get_node_id.return_value = "fake_node_id"
monkeypatch.setattr(ray, "get_runtime_context", lambda: fake_ctx)

fake_handle = MagicMock(name="StatsActorHandle")
fake_options = MagicMock()
fake_options.remote.return_value = fake_handle
monkeypatch.setattr(_StatsActor, "options", lambda **kwargs: fake_options)

assert get_or_create_stats_actor() is fake_handle


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a brittle test that relies heavily on mocking internals.

Could we nuke this? Ray Data doesn't officially support Ray Client, so I don't think this test is worth the maintenance cost

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed the test

Comment on lines +870 to +871
global_node = ray._private.worker._global_node
if global_node is not None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining how global_node can be None? Don't think it'll be obvious to future readers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added a comment

Signed-off-by: Yuang Gao <yg2315@nyu.edu>
@bveeramani bveeramani enabled auto-merge (squash) June 7, 2026 08:10
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label Jun 7, 2026
@bveeramani bveeramani merged commit c0bd1e7 into ray-project:master Jun 7, 2026
8 checks passed
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…oject#63402)

## Description

In Ray Client mode, `ray._private.worker._global_node` is `None` because
the client driver is not a Ray worker process, even though
`ray.is_initialized()` is `True` and the cluster is connected.
`get_or_create_stats_actor` used
`_global_node` as a proxy for "connected to Ray" and raised
`RuntimeError` whenever Ray Data tried to register or query the stats
actor, causing `ds.take_batch()`, `ds.iter_batches()`, etc. to crash on
materialized datasets.

Use `ray.is_initialized()` for the connection check and only emit the
`cluster_id` debug log when `_global_node` is available, since
`cluster_id` is not exposed via `ray.get_runtime_context()`.

## Related issues

Closes ray-project#61162

## Additional information

---------

Signed-off-by: Yuang Gao <yg2315@nyu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

3 participants