Skip to content

[data][train] Add release test for Ray Data training ingest regression testing#63775

Merged
justinvyu merged 10 commits into
ray-project:masterfrom
justinvyu:justinvyu/simple-ingest-benchmark
Jun 2, 2026
Merged

[data][train] Add release test for Ray Data training ingest regression testing#63775
justinvyu merged 10 commits into
ray-project:masterfrom
justinvyu:justinvyu/simple-ingest-benchmark

Conversation

@justinvyu

@justinvyu justinvyu commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Description

Adds a release test (training_ingest_regression_test) that probes the Ray Data -> Ray Train ingest pipeline end-to-end via iter_torch_batches. Catches two complementary regression classes:

  • Peak object-store memory — back-pressured config (--step-sleep-s=2.0 simulating a slow forward) fills consumer-side iter-batches buffers. This variant stress tests object store memory pressure and highlights the underestimation gap of the prefetch_batches implementation.
  • Throughput — same config without the sleep, so the data pipeline is the rate-limiter and any pipeline-rate regression shows up. This ensures that the changes in following PRs are safe to land.

The goal of this test is to capture the tradeoff between reducing peak object store memory (and consequently prevent spilling) while maintaining high GPU saturation.

How is this different from existing training ingest tests?

  • Real H2D + real GPU steps: actually runs TorchTrainer with DDP, real forward/backward/optimizer on the GPU, and times the train loop using CUDA events. Captures interactions between the dataloader pipeline, shared memory unpinning during the device transfer step, and GPU memory / copy-engine behavior.
  • Configured to expose the consumer-side accounting gap: batch_size=1024 (≈588 MiB / batch), prefetch_batches=4, num_workers=4 per node — deliberately chosen so the consumer pipeline's hidden buffering (iter-batches outer/inner queues, format/collate workers, sliding window) holds far more object store memory than consumers report back to the resource manager. Result: measured peak object-store usage is substantially higher than the executor thinks, making the underestimation visible enough to regression-gate on.
  • backpressure_benchmark.training_prefetch is a related test which runs without GPU with a slow mocked training step on a more memory-constrained cluster. That one is a cheaper test to catch major regressions in usage estimation by raising if it spills.

Note: as a followup, we should dedupe the 4 sets of training ingest tests we currently have. This one + backpressure_benchmark are the highest signal.

Results

https://buildkite.com/ray-project/release/builds/95101

The main symptom is that the actual object store memory usage exceeds the estimated usage. Ray Data thinks that it's respecting the budget-based backpressure, but there is an accounting gap between the estimated usage and actual usage. This gap is mostly attributed to hidden buffers which will be removed/reduced by #63660, #63682).

This diagram shows the breakdown of object store memory usage captured by the Ray Data and the usage overage.

  [      usage/budget = 93.1GiB/93.1 GiB         ] [ usage overage               ]
  [   pipeline usage  ][ prefetch_blocks_locally ] [ hidden buffers on 4 workers ]

Here's a breakdown of the hidden buffers on each worker:

Master, pf=prefetch_batches=4, N=min(4, prefetch_batches)=4:

Buffer Formula At pf=4, N=4 What it holds
outer input queue prefetch_batches + 1 5 RefBundles waiting to enter pipeline; note: technically there's no 1:1 block to batch mapping, but just assume 1 block held in memory is 1 batch for simplicity.
outer in-flight 1 1 RefBundle currently driving _pipeline
collate/format input queue 2 × N 8 pa.Table slices waiting for format workers
collate/format in-flight N 4 pa.Tables being processed by N workers
collate/format output queue (zero-copy when pin=F) N 4 Post-format torch tensors (still Plasma-backed via torch.as_tensor)
Hidden batches per worker (pf+1) + 1 + 2N + N + N = pf + 4N + 2 22

Buffers that exist but either are tracked or are not in object store memory:

Slot At pf=4 Why excluded
sliding prefetch window (prefetch_batches_locally) 4 Resource manager already accounts for this via iter_prefetched_bytes
outer output queue (post-finalize) 4 GPU memory under auto-finalize (or pinned host under custom collate) — not Plasma

So hidden obj store mem cost / worker ≈ 22 × batch_bytes, and cluster-wide hidden = 22 × batch_bytes × num_workers.

The release test uses batch_bytes=588 MiB, num_workers=4 on a single node.

Variant Budget Measured peak Budget overage Predicted hidden Realization (actual/predicted)
Master, peak_object_store_memory (sleep 2s) 93.1 GiB 136.19 ± 0.32 GiB 43.09 GiB 22 slots × 4w × 588 MiB = 51.6 GiB 83%
Master, throughput (no sleep) 93.1 GiB 115.27 ± 12.44 GiB 22.17 GiB 22 slots × 4w × 588 MiB = 51.6 GiB 43%

The throughput variant has 43% realization (vs 83%) because the fast consumer drains the iter-batches queues as fast as the producer fills them, so the hidden buffers sit half-empty in steady state instead of fully under back-pressure.

Conclusion: There's currently significant untracked object store memory usage in the Ray Data training ingest codepath, and this number scales with the number of workers per node, the batch size, and the prefetch_batches configuration.

justinvyu added 3 commits May 29, 2026 15:19
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a simplified single-file ResNet50 / ImageNet-parquet ingest and training benchmark script to measure throughput and pipeline latency. The review feedback highlights several opportunities for improvement: resolving a batch size inconsistency between the code and comments for the data_bound variant, using a deterministic hashing function like zlib.crc32 instead of hash() to ensure consistent label mapping across distributed workers, adding a fallback directory for profiling traces to prevent permission errors on local machines, and simplifying the percentile calculation using np.percentile.

Comment thread release/train_tests/benchmark/simple_ingest_benchmark.py Outdated
Comment thread release/train_tests/benchmark/simple_ingest_benchmark.py Outdated
justinvyu added 6 commits June 1, 2026 17:27
…elease test

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…est_regression_test

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu changed the title [WIP[[data][train] Add simple_ingest_benchmark for Ray Data + Train ingest regression testing Jun 2, 2026
@justinvyu justinvyu marked this pull request as ready for review June 2, 2026 03:09
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu enabled auto-merge (squash) June 2, 2026 07:36
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label Jun 2, 2026
@ray-gardener ray-gardener Bot added data Ray Data-related issues release-test release test labels Jun 2, 2026
@justinvyu justinvyu merged commit d1d84e2 into ray-project:master Jun 2, 2026
10 checks passed
@justinvyu justinvyu deleted the justinvyu/simple-ingest-benchmark branch June 2, 2026 22:19
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
…n testing (ray-project#63775)

Adds a release test (training_ingest_regression_test) that probes the
Ray Data -> Ray Train ingest pipeline end-to-end via
`iter_torch_batches`. Catches two complementary regression classes:
- Peak object-store memory — back-pressured config (`--step-sleep-s=2.0`
simulating a slow forward) fills consumer-side iter-batches buffers.
This variant stress tests object store memory pressure and highlights
the underestimation gap of the `prefetch_batches` implementation.
- Throughput — same config without the sleep, so the data pipeline is
the rate-limiter and any pipeline-rate regression shows up. This ensures
that the changes in following PRs are safe to land.

The goal of this test is to capture the tradeoff between reducing peak
object store memory (and consequently prevent spilling) while
maintaining high GPU saturation.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…n testing (ray-project#63775)

Adds a release test (training_ingest_regression_test) that probes the
Ray Data -> Ray Train ingest pipeline end-to-end via
`iter_torch_batches`. Catches two complementary regression classes:
- Peak object-store memory — back-pressured config (`--step-sleep-s=2.0`
simulating a slow forward) fills consumer-side iter-batches buffers.
This variant stress tests object store memory pressure and highlights
the underestimation gap of the `prefetch_batches` implementation.
- Throughput — same config without the sleep, so the data pipeline is
the rate-limiter and any pipeline-rate regression shows up. This ensures
that the changes in following PRs are safe to land.

The goal of this test is to capture the tradeoff between reducing peak
object store memory (and consequently prevent spilling) while
maintaining high GPU saturation.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests release-test release test

2 participants