[Doc] Add map_batches shuffle section to shuffling data guide by xinyuangui2 · Pull Request #62576 · ray-project/ray

xinyuangui2 · 2026-04-13T21:19:46Z

Add documentation for using map_batches as a distributed shuffle pipeline stage, with benchmark results comparing it to local buffer shuffle (80-90% vs 9-13% of baseline throughput).

Add documentation for using map_batches as a distributed shuffle pipeline stage, with benchmark results comparing it to local buffer shuffle (80-90% vs 9-13% of baseline throughput). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a benchmark script and documentation for a distributed shuffling method using map_batches in Ray Data, highlighting its performance advantages over local buffer shuffling. The review feedback suggests allocating CPU resources to the shuffle task to prevent performance issues and improving error handling in the benchmark script by logging exceptions during metric retrieval.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit d0c219d. Configure here.}

Wrap map_batches in backticks to fix Vale.Spelling errors and use contraction "doesn't" per Google.Contractions style. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

richardliaw

changes to slim down the PR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

…oject#62576) Add documentation for using map_batches as a distributed shuffle pipeline stage, with benchmark results comparing it to local buffer shuffle (80-90% vs 9-13% of baseline throughput). --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

xinyuangui2 requested a review from a team as a code owner April 13, 2026 21:19

xinyuangui2 requested a review from richardliaw April 13, 2026 21:19

gemini-code-assist Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread doc/source/data/doc_code/benchmark_local_vs_map_batches_shuffle.py Outdated

Comment thread doc/source/data/doc_code/benchmark_local_vs_map_batches_shuffle.py Outdated

cursor Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread doc/source/data/doc_code/benchmark_local_vs_map_batches_shuffle.py Outdated

ray-gardener Bot added docs An issue or change related to documentation data Ray Data-related issues labels Apr 14, 2026

xinyuangui2 and others added 2 commits April 14, 2026 10:49

Merge branch 'master' into map_batch_based_shuffling

1bc719c

[Doc] Fix vale lint errors in map_batches shuffle section

4cfbedb

Wrap map_batches in backticks to fix Vale.Spelling errors and use contraction "doesn't" per Google.Contractions style. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

richardliaw reviewed Apr 14, 2026

View reviewed changes

Comment thread doc/source/data/shuffling-data.rst Outdated

richardliaw reviewed Apr 14, 2026

View reviewed changes

Comment thread doc/source/data/shuffling-data.rst Outdated

richardliaw reviewed Apr 14, 2026

View reviewed changes

Comment thread doc/source/data/shuffling-data.rst Outdated

richardliaw reviewed Apr 14, 2026

View reviewed changes

Comment thread doc/source/data/doc_code/benchmark_local_vs_map_batches_shuffle.py Outdated

richardliaw requested changes Apr 14, 2026

View reviewed changes

xinyuangui2 and others added 2 commits April 14, 2026 21:41

Merge branch 'master' into map_batch_based_shuffling

82b26d6

[Doc] Remove benchmark reproduce script from shuffling docs

dbae0c6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

xinyuangui2 requested a review from richardliaw April 15, 2026 04:55

richardliaw reviewed Apr 15, 2026

View reviewed changes

Comment thread doc/source/data/shuffling-data.rst

richardliaw added 2 commits April 15, 2026 09:09

Apply suggestions from code review

7167602

Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Update shuffling-data.rst

ec40f72

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw approved these changes Apr 15, 2026

View reviewed changes

richardliaw added the go add ONLY when ready to merge, run all tests label Apr 15, 2026

richardliaw merged commit 6074f4c into ray-project:master Apr 15, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Doc] Add map_batches shuffle section to shuffling data guide#62576

[Doc] Add map_batches shuffle section to shuffling data guide#62576
richardliaw merged 7 commits into
ray-project:masterfrom
xinyuangui2:map_batch_based_shuffling

xinyuangui2 commented Apr 13, 2026

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardliaw left a comment

Uh oh!

Uh oh!

Labels

2 participants

Uh oh!

Conversation

xinyuangui2 commented Apr 13, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardliaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

2 participants