Skip to content

[Doc] Add map_batches shuffle section to shuffling data guide#62576

Merged
richardliaw merged 7 commits into
ray-project:masterfrom
xinyuangui2:map_batch_based_shuffling
Apr 15, 2026
Merged

[Doc] Add map_batches shuffle section to shuffling data guide#62576
richardliaw merged 7 commits into
ray-project:masterfrom
xinyuangui2:map_batch_based_shuffling

Conversation

@xinyuangui2

Copy link
Copy Markdown
Contributor

Add documentation for using map_batches as a distributed shuffle pipeline stage, with benchmark results comparing it to local buffer shuffle (80-90% vs 9-13% of baseline throughput).

Add documentation for using map_batches as a distributed shuffle
pipeline stage, with benchmark results comparing it to local buffer
shuffle (80-90% vs 9-13% of baseline throughput).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@xinyuangui2 xinyuangui2 requested a review from a team as a code owner April 13, 2026 21:19
@xinyuangui2 xinyuangui2 requested a review from richardliaw April 13, 2026 21:19

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a benchmark script and documentation for a distributed shuffling method using map_batches in Ray Data, highlighting its performance advantages over local buffer shuffling. The review feedback suggests allocating CPU resources to the shuffle task to prevent performance issues and improving error handling in the benchmark script by logging exceptions during metric retrieval.

Comment thread doc/source/data/doc_code/benchmark_local_vs_map_batches_shuffle.py Outdated
Comment thread doc/source/data/doc_code/benchmark_local_vs_map_batches_shuffle.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit d0c219d. Configure here.

Comment thread doc/source/data/doc_code/benchmark_local_vs_map_batches_shuffle.py Outdated
@ray-gardener ray-gardener Bot added docs An issue or change related to documentation data Ray Data-related issues labels Apr 14, 2026
xinyuangui2 and others added 2 commits April 14, 2026 10:49
Wrap map_batches in backticks to fix Vale.Spelling errors and use
contraction "doesn't" per Google.Contractions style.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread doc/source/data/shuffling-data.rst Outdated
Comment thread doc/source/data/shuffling-data.rst Outdated
Comment thread doc/source/data/shuffling-data.rst Outdated
Comment thread doc/source/data/doc_code/benchmark_local_vs_map_batches_shuffle.py Outdated

@richardliaw richardliaw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes to slim down the PR

@xinyuangui2 xinyuangui2 requested a review from richardliaw April 15, 2026 04:55
Comment thread doc/source/data/shuffling-data.rst
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw added the go add ONLY when ready to merge, run all tests label Apr 15, 2026
@richardliaw richardliaw merged commit 6074f4c into ray-project:master Apr 15, 2026
8 checks passed
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
…oject#62576)

Add documentation for using map_batches as a distributed shuffle
pipeline stage, with benchmark results comparing it to local buffer
shuffle (80-90% vs 9-13% of baseline throughput).

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…oject#62576)

Add documentation for using map_batches as a distributed shuffle
pipeline stage, with benchmark results comparing it to local buffer
shuffle (80-90% vs 9-13% of baseline throughput).

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

2 participants