[data] Add `Dataset.mix()` public API and user guide for weighted dataset mixing by justinvyu · Pull Request #63168 · ray-project/ray

justinvyu · 2026-05-06T22:43:25Z

Description

Adds Dataset.mix() classmethod (alpha) for streaming weighted interleaving of multiple datasets, building on the internal MixOperator from [data] Add MixOperator for weighted dataset mixing #62450.
Exports MixStoppingCondition from ray.data.
Adds a user guide under Ray Data docs covering per-block mixing, random mixing, stopping conditions, and limitations. Moved the other "scaling collation" user guide to Ray Data docs.
Updates test_mix.py to use the public API instead of the internal helper.

Testing

See additional testing and result here: https://gist.github.com/justinvyu/0b73d66397a3fa0d9286f88b5e3ec3c3
To be added as a release test.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces the Dataset.mix public API to Ray Data, enabling weighted interleaving of multiple datasets. The changes include the implementation of the mix method, the addition of a comprehensive user guide, and the promotion of MixStoppingCondition to the public API. Feedback focuses on aligning documentation terminology with enum names, implementing validation for input weights to prevent logical errors, and ensuring consistent context usage in the execution plan.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

richardliaw · 2026-05-08T01:21:53Z

+        stats = DatasetStats(
+            metadata={"Mix": []},
+            parent=[d._raw_stats() for d in datasets],
+        )
+        stats.time_total_s = time.perf_counter() - start_time


what's up with all of this stuff?

This is also done in Union. let's just keep it consistent and consider removing later since it's just measuring a small amount of logical operator creation time.

ack, i was looking at join and didn't see the equivalent

richardliaw · 2026-05-08T01:23:36Z

        )

+    @classmethod
+    @PublicAPI(stability="alpha", api_group=SMJ_API_GROUP)


feel like this should not be in SMJ group but rather like training ingest group or something

I think it's ok for now since it falls in "merging datasets." Let's land this and reorganize APIs in a followup.

…ncy with union/zip Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit aac2eab. Configure here.}

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

richardliaw

nice!

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…aset mixing (ray-project#63168) * Adds Dataset.mix() classmethod (alpha) for streaming weighted interleaving of multiple datasets, building on the internal MixOperator from ray-project#62450. * Exports MixStoppingCondition from `ray.data`. * Adds a user guide under Ray Data docs covering per-block mixing, random mixing, stopping conditions, and limitations. Moved the other "scaling collation" user guide to Ray Data docs. * Updates test_mix.py to use the public API instead of the internal helper. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Adds a release test benchmark for `Dataset.mix()` (introduced in #63168) that measures mixing throughput and ratio accuracy. Benchmark design: - Creates 8 datasets reading ImageNet parquet, each stamped with a ds_index column - Mixes with Dataset.mix(), repartitions to 4 * batch_size rows per block - Consumes via TorchTrainer to mimic the seen weighting ratio when ingesting multiple local batches which are split across workers. - Tracks per-batch mixing ratios per worker, aggregates mean/std across workers via all_reduce to get the mean and standard deviation across **global batches.** - Asserts ratio mean is within 0.05 of target and std < 0.1 - Tests with and without a shuffling step after mixing with `--num-workers=1` to showcase the effectiveness of shuffling removing the dependency of mixing quality on the number of workers. [See here for more details.](https://docs.ray.io/en/master/data/mixing-data.html#random-mixing) --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…3286) Adds a release test benchmark for `Dataset.mix()` (introduced in ray-project#63168) that measures mixing throughput and ratio accuracy. Benchmark design: - Creates 8 datasets reading ImageNet parquet, each stamped with a ds_index column - Mixes with Dataset.mix(), repartitions to 4 * batch_size rows per block - Consumes via TorchTrainer to mimic the seen weighting ratio when ingesting multiple local batches which are split across workers. - Tracks per-batch mixing ratios per worker, aggregates mean/std across workers via all_reduce to get the mean and standard deviation across **global batches.** - Asserts ratio mean is within 0.05 of target and std < 0.1 - Tests with and without a shuffling step after mixing with `--num-workers=1` to showcase the effectiveness of shuffling removing the dependency of mixing quality on the number of workers. [See here for more details.](https://docs.ray.io/en/master/data/mixing-data.html#random-mixing) --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>

justinvyu added 4 commits May 5, 2026 13:15

add public api

58b538e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add public api annotation

9e4ba87

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix default

1bddc3d

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add doc

aede3ad

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested review from a team as code owners May 6, 2026 22:43

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Comment thread doc/source/train/user-guides/dataset-mixing.rst Outdated

Comment thread python/ray/data/dataset.py

Comment thread python/ray/data/dataset.py Outdated

ray-gardener Bot added docs An issue or change related to documentation data Ray Data-related issues labels May 7, 2026

justinvyu added 5 commits May 7, 2026 16:47

add note about small block size after repartition

75d7261

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add api ref for mix stopping condition

255e110

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix mix stopping coindition name

88c9dcf

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

clarification

e27c380

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

improve the operator name in the progress bar

f3c7b49

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

richardliaw reviewed May 8, 2026

View reviewed changes

Comment thread python/ray/data/dataset.py Outdated

richardliaw reviewed May 8, 2026

View reviewed changes

Comment thread doc/source/train/user-guides.rst

richardliaw reviewed May 8, 2026

View reviewed changes

Comment thread doc/source/train/user-guides/dataset-mixing.rst Outdated

richardliaw reviewed May 8, 2026

View reviewed changes

Comment thread doc/source/train/user-guides/data-loading-preprocessing.rst Outdated

justinvyu and others added 3 commits May 8, 2026 12:05

switch Dataset.mix() from classmethod to instance method for consiste…

1553c9b

…ncy with union/zip Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into mix-api

7efea90

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

switch doc examples to testcode with synthetic data for CI validation

aac2eab

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

cursor Bot reviewed May 8, 2026

View reviewed changes

Comment thread python/ray/data/dataset.py Outdated

justinvyu and others added 3 commits May 8, 2026 13:46

move dataset mixing and scaling collation guides from train to data docs

e1310ab

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

refactor

4cefa18

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix vale lint errors in mixing and scaling collation docs

b559f54

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

richardliaw added the go add ONLY when ready to merge, run all tests label May 8, 2026

richardliaw approved these changes May 8, 2026

View reviewed changes

matthewdeng enabled auto-merge (squash) May 8, 2026 21:52

matthewdeng approved these changes May 8, 2026

View reviewed changes

justinvyu added 2 commits May 8, 2026 17:22

don't run because not enough resources on ci

bb05709

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

nit

b4935e2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

github-actions Bot disabled auto-merge May 9, 2026 00:24

extra commas

77f524e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu enabled auto-merge (squash) May 9, 2026 00:53

justinvyu merged commit a1d1bde into ray-project:master May 10, 2026
7 checks passed

justinvyu deleted the mix-api branch May 11, 2026 16:27

justinvyu mentioned this pull request May 12, 2026

[data] Add Dataset.mix() release test microbenchmark #63286

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] Add `Dataset.mix()` public API and user guide for weighted dataset mixing #63168

[data] Add `Dataset.mix()` public API and user guide for weighted dataset mixing #63168
justinvyu merged 18 commits into
ray-project:masterfrom
justinvyu:mix-api

justinvyu commented May 6, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardliaw May 8, 2026

justinvyu May 8, 2026

richardliaw May 8, 2026

richardliaw May 8, 2026

justinvyu May 8, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

richardliaw left a comment

Uh oh!

Labels

3 participants

Uh oh!

Conversation

justinvyu commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardliaw May 8, 2026

Choose a reason for hiding this comment

justinvyu May 8, 2026

Choose a reason for hiding this comment

richardliaw May 8, 2026

Choose a reason for hiding this comment

richardliaw May 8, 2026

Choose a reason for hiding this comment

justinvyu May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

richardliaw left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

justinvyu commented May 6, 2026 •

edited

Loading