Skip to content

[data] Disable DataSourceV2 by default#63674

Merged
goutamvenkat-anyscale merged 2 commits into
ray-project:masterfrom
goutamvenkat-anyscale:goutam/data-disable-datasource-v2-default
Jun 1, 2026
Merged

[data] Disable DataSourceV2 by default#63674
goutamvenkat-anyscale merged 2 commits into
ray-project:masterfrom
goutamvenkat-anyscale:goutam/data-disable-datasource-v2-default

Conversation

@goutamvenkat-anyscale

@goutamvenkat-anyscale goutamvenkat-anyscale commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Flip DEFAULT_USE_DATASOURCE_V2 from True to False. V2 currently OOMs on shuffle-heavy workloads because ReadFiles.infer_metadata() returns no size_bytes, dropping HashShufflingOperatorBase._get_default_aggregator_ray_remote_args into its 1-GiB-per-aggregator fallback. Most visibly: TPC-H Q9 SF=100 autoscaling release test fails with HashShuffleAggregator actor death from host-level OOM.

The fix (sample-extrapolated size_bytes surfaced via ReadFiles.infer_metadata) is on a separate branch — flipping this off keeps master green until that lands and the release tests pass.

Users can still opt in via DataContext.use_datasource_v2 = True.

Test plan

  • CI is green with V1 default (no V2 regressions reintroduced).
  • Run TPC-H Q9 autoscaling release test on master with this flip — expect pass.
  • Confirm read_parquet users on V1 path see unchanged behavior.

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the default configuration in python/ray/data/context.py by setting DEFAULT_USE_DATASOURCE_V2 to False instead of True. There are no review comments to address, and I have no additional feedback to provide.

@goutamvenkat-anyscale goutamvenkat-anyscale force-pushed the goutam/data-disable-datasource-v2-default branch from 457c636 to e95b6c8 Compare May 30, 2026 00:14
@goutamvenkat-anyscale goutamvenkat-anyscale added the data Ray Data-related issues label Jun 1, 2026
V2 currently OOMs on shuffle-heavy workloads because
``ReadFiles.infer_metadata()`` returns no ``size_bytes``, dropping
``HashShufflingOperatorBase._get_default_aggregator_ray_remote_args``
into its 1-GiB-per-aggregator fallback (e.g., TPC-H Q9 SF=100 autoscaling
release test). The fix is on a separate branch (sample-extrapolated
estimate surfaced via ``ReadFiles.infer_metadata``); flip the default
off until that lands and the release tests pass.

Users can still opt in with ``DataContext.use_datasource_v2 = True``.

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale force-pushed the goutam/data-disable-datasource-v2-default branch from 63dbe87 to ff7b958 Compare June 1, 2026 18:56
@goutamvenkat-anyscale goutamvenkat-anyscale marked this pull request as ready for review June 1, 2026 18:57
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner June 1, 2026 18:57
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Jun 1, 2026

@ayushk7102 ayushk7102 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@goutamvenkat-anyscale goutamvenkat-anyscale merged commit dfbabb9 into ray-project:master Jun 1, 2026
8 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/data-disable-datasource-v2-default branch June 1, 2026 22:15
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
## Summary

Flip ``DEFAULT_USE_DATASOURCE_V2`` from ``True`` to ``False``. V2
currently OOMs on shuffle-heavy workloads because
``ReadFiles.infer_metadata()`` returns no ``size_bytes``, dropping
``HashShufflingOperatorBase._get_default_aggregator_ray_remote_args``
into its 1-GiB-per-aggregator fallback. Most visibly: TPC-H Q9 SF=100
autoscaling release test fails with ``HashShuffleAggregator`` actor
death from host-level OOM.

The fix (sample-extrapolated ``size_bytes`` surfaced via
``ReadFiles.infer_metadata``) is on a separate branch — flipping this
off keeps master green until that lands and the release tests pass.

Users can still opt in via ``DataContext.use_datasource_v2 = True``.

## Test plan

- [ ] CI is green with V1 default (no V2 regressions reintroduced).
- [ ] Run TPC-H Q9 autoscaling release test on master with this flip —
expect pass.
- [ ] Confirm ``read_parquet`` users on V1 path see unchanged behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: Goutam <goutam@anyscale.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
## Summary

Flip ``DEFAULT_USE_DATASOURCE_V2`` from ``True`` to ``False``. V2
currently OOMs on shuffle-heavy workloads because
``ReadFiles.infer_metadata()`` returns no ``size_bytes``, dropping
``HashShufflingOperatorBase._get_default_aggregator_ray_remote_args``
into its 1-GiB-per-aggregator fallback. Most visibly: TPC-H Q9 SF=100
autoscaling release test fails with ``HashShuffleAggregator`` actor
death from host-level OOM.

The fix (sample-extrapolated ``size_bytes`` surfaced via
``ReadFiles.infer_metadata``) is on a separate branch — flipping this
off keeps master green until that lands and the release tests pass.

Users can still opt in via ``DataContext.use_datasource_v2 = True``.

## Test plan

- [ ] CI is green with V1 default (no V2 regressions reintroduced).
- [ ] Run TPC-H Q9 autoscaling release test on master with this flip —
expect pass.
- [ ] Confirm ``read_parquet`` users on V1 path see unchanged behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants