[data][train] Add release test for Ray Data training ingest regression testing#63775
Merged
justinvyu merged 10 commits intoJun 2, 2026
Merged
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a simplified single-file ResNet50 / ImageNet-parquet ingest and training benchmark script to measure throughput and pipeline latency. The review feedback highlights several opportunities for improvement: resolving a batch size inconsistency between the code and comments for the data_bound variant, using a deterministic hashing function like zlib.crc32 instead of hash() to ensure consistent label mapping across distributed workers, adding a fallback directory for profiling traces to prevent permission errors on local machines, and simplifying the percentile calculation using np.percentile.
…elease test Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…est_regression_test Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
xinyuangui2
approved these changes
Jun 2, 2026
rueian
pushed a commit
to rueian/ray
that referenced
this pull request
Jun 4, 2026
…n testing (ray-project#63775) Adds a release test (training_ingest_regression_test) that probes the Ray Data -> Ray Train ingest pipeline end-to-end via `iter_torch_batches`. Catches two complementary regression classes: - Peak object-store memory — back-pressured config (`--step-sleep-s=2.0` simulating a slow forward) fills consumer-side iter-batches buffers. This variant stress tests object store memory pressure and highlights the underestimation gap of the `prefetch_batches` implementation. - Throughput — same config without the sleep, so the data pipeline is the rate-limiter and any pipeline-rate regression shows up. This ensures that the changes in following PRs are safe to land. The goal of this test is to capture the tradeoff between reducing peak object store memory (and consequently prevent spilling) while maintaining high GPU saturation. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
limarkdcunha
pushed a commit
to limarkdcunha/ray
that referenced
this pull request
Jun 30, 2026
…n testing (ray-project#63775) Adds a release test (training_ingest_regression_test) that probes the Ray Data -> Ray Train ingest pipeline end-to-end via `iter_torch_batches`. Catches two complementary regression classes: - Peak object-store memory — back-pressured config (`--step-sleep-s=2.0` simulating a slow forward) fills consumer-side iter-batches buffers. This variant stress tests object store memory pressure and highlights the underestimation gap of the `prefetch_batches` implementation. - Throughput — same config without the sleep, so the data pipeline is the rate-limiter and any pipeline-rate regression shows up. This ensures that the changes in following PRs are safe to land. The goal of this test is to capture the tradeoff between reducing peak object store memory (and consequently prevent spilling) while maintaining high GPU saturation. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a release test (training_ingest_regression_test) that probes the Ray Data -> Ray Train ingest pipeline end-to-end via
iter_torch_batches. Catches two complementary regression classes:--step-sleep-s=2.0simulating a slow forward) fills consumer-side iter-batches buffers. This variant stress tests object store memory pressure and highlights the underestimation gap of theprefetch_batchesimplementation.The goal of this test is to capture the tradeoff between reducing peak object store memory (and consequently prevent spilling) while maintaining high GPU saturation.
How is this different from existing training ingest tests?
backpressure_benchmark.training_prefetchis a related test which runs without GPU with a slow mocked training step on a more memory-constrained cluster. That one is a cheaper test to catch major regressions in usage estimation by raising if it spills.Note: as a followup, we should dedupe the 4 sets of training ingest tests we currently have. This one +
backpressure_benchmarkare the highest signal.Results
https://buildkite.com/ray-project/release/builds/95101
The main symptom is that the actual object store memory usage exceeds the estimated usage. Ray Data thinks that it's respecting the budget-based backpressure, but there is an accounting gap between the estimated usage and actual usage. This gap is mostly attributed to hidden buffers which will be removed/reduced by #63660, #63682).
This diagram shows the breakdown of object store memory usage captured by the Ray Data and the usage overage.
Here's a breakdown of the hidden buffers on each worker:
Master,
pf=prefetch_batches=4,N=min(4, prefetch_batches)=4:prefetch_batches + 11_pipeline2 × NNNtorch.as_tensor)(pf+1) + 1 + 2N + N + N=pf + 4N + 2Buffers that exist but either are tracked or are not in object store memory:
prefetch_batches_locally)iter_prefetched_bytesSo
hidden obj store mem cost / worker ≈ 22 × batch_bytes, and cluster-wide hidden =22 × batch_bytes × num_workers.The release test uses
batch_bytes=588 MiB,num_workers=4on a single node.peak_object_store_memory(sleep 2s)throughput(no sleep)The throughput variant has 43% realization (vs 83%) because the fast consumer drains the iter-batches queues as fast as the producer fills them, so the hidden buffers sit half-empty in steady state instead of fully under back-pressure.
Conclusion: There's currently significant untracked object store memory usage in the Ray Data training ingest codepath, and this number scales with the number of workers per node, the batch size, and the
prefetch_batchesconfiguration.