Skip to content

[data] Add Dataset.mix() public API and user guide for weighted dataset mixing #63168

Merged
justinvyu merged 18 commits into
ray-project:masterfrom
justinvyu:mix-api
May 10, 2026
Merged

[data] Add Dataset.mix() public API and user guide for weighted dataset mixing #63168
justinvyu merged 18 commits into
ray-project:masterfrom
justinvyu:mix-api

Conversation

@justinvyu

@justinvyu justinvyu commented May 6, 2026

Copy link
Copy Markdown
Contributor

Description

  • Adds Dataset.mix() classmethod (alpha) for streaming weighted interleaving of multiple datasets, building on the internal MixOperator from [data] Add MixOperator for weighted dataset mixing #62450.
  • Exports MixStoppingCondition from ray.data.
  • Adds a user guide under Ray Data docs covering per-block mixing, random mixing, stopping conditions, and limitations. Moved the other "scaling collation" user guide to Ray Data docs.
  • Updates test_mix.py to use the public API instead of the internal helper.

Testing

See additional testing and result here: https://gist.github.com/justinvyu/0b73d66397a3fa0d9286f88b5e3ec3c3
To be added as a release test.

justinvyu added 4 commits May 5, 2026 13:15
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested review from a team as code owners May 6, 2026 22:43

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Dataset.mix public API to Ray Data, enabling weighted interleaving of multiple datasets. The changes include the implementation of the mix method, the addition of a comprehensive user guide, and the promotion of MixStoppingCondition to the public API. Feedback focuses on aligning documentation terminology with enum names, implementing validation for input weights to prevent logical errors, and ensuring consistent context usage in the execution plan.

Comment thread doc/source/train/user-guides/dataset-mixing.rst Outdated
Comment thread python/ray/data/dataset.py
Comment thread python/ray/data/dataset.py Outdated
@ray-gardener ray-gardener Bot added docs An issue or change related to documentation data Ray Data-related issues labels May 7, 2026
justinvyu added 5 commits May 7, 2026 16:47
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Comment thread python/ray/data/dataset.py Outdated
Comment on lines +2959 to +2963
stats = DatasetStats(
metadata={"Mix": []},
parent=[d._raw_stats() for d in datasets],
)
stats.time_total_s = time.perf_counter() - start_time

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's up with all of this stuff?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also done in Union. let's just keep it consistent and consider removing later since it's just measuring a small amount of logical operator creation time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, i was looking at join and didn't see the equivalent

)

@classmethod
@PublicAPI(stability="alpha", api_group=SMJ_API_GROUP)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel like this should not be in SMJ group but rather like training ingest group or something

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok for now since it falls in "merging datasets." Let's land this and reorganize APIs in a followup.

Comment thread doc/source/train/user-guides.rst
Comment thread doc/source/train/user-guides/dataset-mixing.rst Outdated
Comment thread doc/source/train/user-guides/data-loading-preprocessing.rst Outdated
justinvyu and others added 3 commits May 8, 2026 12:05
…ncy with union/zip

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit aac2eab. Configure here.

Comment thread python/ray/data/dataset.py Outdated
justinvyu and others added 3 commits May 8, 2026 13:46
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@richardliaw richardliaw added the go add ONLY when ready to merge, run all tests label May 8, 2026

@richardliaw richardliaw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@matthewdeng matthewdeng enabled auto-merge (squash) May 8, 2026 21:52
justinvyu added 2 commits May 8, 2026 17:22
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@github-actions github-actions Bot disabled auto-merge May 9, 2026 00:24
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu enabled auto-merge (squash) May 9, 2026 00:53
@justinvyu justinvyu merged commit a1d1bde into ray-project:master May 10, 2026
7 checks passed
@justinvyu justinvyu deleted the mix-api branch May 11, 2026 16:27
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…aset mixing (ray-project#63168)

* Adds Dataset.mix() classmethod (alpha) for streaming weighted
interleaving of multiple datasets, building on the internal MixOperator
from ray-project#62450.
* Exports MixStoppingCondition from `ray.data`.
* Adds a user guide under Ray Data docs covering per-block mixing,
random mixing, stopping conditions, and limitations. Moved the other
"scaling collation" user guide to Ray Data docs.
* Updates test_mix.py to use the public API instead of the internal
helper.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
justinvyu added a commit that referenced this pull request May 15, 2026
Adds a release test benchmark for `Dataset.mix()` (introduced in #63168)
that measures mixing throughput and ratio accuracy.

Benchmark design:
- Creates 8 datasets reading ImageNet parquet, each stamped with a
ds_index column
- Mixes with Dataset.mix(), repartitions to 4 * batch_size rows per
block
- Consumes via TorchTrainer to mimic the seen weighting ratio when
ingesting multiple local batches which are split across workers.
- Tracks per-batch mixing ratios per worker, aggregates mean/std across
workers via all_reduce to get the mean and standard deviation across
**global batches.**
- Asserts ratio mean is within 0.05 of target and std < 0.1
- Tests with and without a shuffling step after mixing with
`--num-workers=1` to showcase the effectiveness of shuffling removing
the dependency of mixing quality on the number of workers. [See here for
more
details.](https://docs.ray.io/en/master/data/mixing-data.html#random-mixing)

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TruongQuangPhat pushed a commit to cyhapun/ray-fix-issue that referenced this pull request May 27, 2026
…3286)

Adds a release test benchmark for `Dataset.mix()` (introduced in ray-project#63168)
that measures mixing throughput and ratio accuracy.

Benchmark design:
- Creates 8 datasets reading ImageNet parquet, each stamped with a
ds_index column
- Mixes with Dataset.mix(), repartitions to 4 * batch_size rows per
block
- Consumes via TorchTrainer to mimic the seen weighting ratio when
ingesting multiple local batches which are split across workers.
- Tracks per-batch mixing ratios per worker, aggregates mean/std across
workers via all_reduce to get the mean and standard deviation across
**global batches.**
- Asserts ratio mean is within 0.05 of target and std < 0.1
- Tests with and without a shuffling step after mixing with
`--num-workers=1` to showcase the effectiveness of shuffling removing
the dependency of mixing quality on the number of workers. [See here for
more
details.](https://docs.ray.io/en/master/data/mixing-data.html#random-mixing)

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

3 participants