[data][train] Add release test for Ray Data training ingest regression testing by justinvyu · Pull Request #63775 · ray-project/ray

justinvyu · 2026-06-01T17:43:39Z

Description

Adds a release test (training_ingest_regression_test) that probes the Ray Data -> Ray Train ingest pipeline end-to-end via iter_torch_batches. Catches two complementary regression classes:

Peak object-store memory — back-pressured config (--step-sleep-s=2.0 simulating a slow forward) fills consumer-side iter-batches buffers. This variant stress tests object store memory pressure and highlights the underestimation gap of the prefetch_batches implementation.
Throughput — same config without the sleep, so the data pipeline is the rate-limiter and any pipeline-rate regression shows up. This ensures that the changes in following PRs are safe to land.

The goal of this test is to capture the tradeoff between reducing peak object store memory (and consequently prevent spilling) while maintaining high GPU saturation.

How is this different from existing training ingest tests?

Real H2D + real GPU steps: actually runs TorchTrainer with DDP, real forward/backward/optimizer on the GPU, and times the train loop using CUDA events. Captures interactions between the dataloader pipeline, shared memory unpinning during the device transfer step, and GPU memory / copy-engine behavior.
Configured to expose the consumer-side accounting gap: batch_size=1024 (≈588 MiB / batch), prefetch_batches=4, num_workers=4 per node — deliberately chosen so the consumer pipeline's hidden buffering (iter-batches outer/inner queues, format/collate workers, sliding window) holds far more object store memory than consumers report back to the resource manager. Result: measured peak object-store usage is substantially higher than the executor thinks, making the underestimation visible enough to regression-gate on.
backpressure_benchmark.training_prefetch is a related test which runs without GPU with a slow mocked training step on a more memory-constrained cluster. That one is a cheaper test to catch major regressions in usage estimation by raising if it spills.

Note: as a followup, we should dedupe the 4 sets of training ingest tests we currently have. This one + backpressure_benchmark are the highest signal.

Results

https://buildkite.com/ray-project/release/builds/95101

The main symptom is that the actual object store memory usage exceeds the estimated usage. Ray Data thinks that it's respecting the budget-based backpressure, but there is an accounting gap between the estimated usage and actual usage. This gap is mostly attributed to hidden buffers which will be removed/reduced by #63660, #63682).

This diagram shows the breakdown of object store memory usage captured by the Ray Data and the usage overage.

  [      usage/budget = 93.1GiB/93.1 GiB         ] [ usage overage               ]
  [   pipeline usage  ][ prefetch_blocks_locally ] [ hidden buffers on 4 workers ]

Here's a breakdown of the hidden buffers on each worker:

Master, pf=prefetch_batches=4, N=min(4, prefetch_batches)=4:

Buffer	Formula	At pf=4, N=4	What it holds
outer input queue	`prefetch_batches + 1`	5	RefBundles waiting to enter pipeline; note: technically there's no 1:1 block to batch mapping, but just assume 1 block held in memory is 1 batch for simplicity.
outer in-flight	`1`	1	RefBundle currently driving `_pipeline`
collate/format input queue	`2 × N`	8	pa.Table slices waiting for format workers
collate/format in-flight	`N`	4	pa.Tables being processed by N workers
collate/format output queue (zero-copy when pin=F)	`N`	4	Post-format torch tensors (still Plasma-backed via `torch.as_tensor`)
Hidden batches per worker	`(pf+1) + 1 + 2N + N + N` = `pf + 4N + 2`	22

Buffers that exist but either are tracked or are not in object store memory:

Slot	At pf=4	Why excluded
sliding prefetch window (`prefetch_batches_locally`)	4	Resource manager already accounts for this via `iter_prefetched_bytes`
outer output queue (post-finalize)	4	GPU memory under auto-finalize (or pinned host under custom collate) — not Plasma

So hidden obj store mem cost / worker ≈ 22 × batch_bytes, and cluster-wide hidden = 22 × batch_bytes × num_workers.

The release test uses batch_bytes=588 MiB, num_workers=4 on a single node.

Variant	Budget	Measured peak	Budget overage	Predicted hidden	Realization (actual/predicted)
Master, `peak_object_store_memory` (sleep 2s)	93.1 GiB	136.19 ± 0.32 GiB	43.09 GiB	22 slots × 4w × 588 MiB = 51.6 GiB	83%
Master, `throughput` (no sleep)	93.1 GiB	115.27 ± 12.44 GiB	22.17 GiB	22 slots × 4w × 588 MiB = 51.6 GiB	43%

The throughput variant has 43% realization (vs 83%) because the fast consumer drains the iter-batches queues as fast as the producer fills them, so the hidden buffers sit half-empty in steady state instead of fully under back-pressure.

Conclusion: There's currently significant untracked object store memory usage in the Ray Data training ingest codepath, and this number scales with the number of workers per node, the batch size, and the prefetch_batches configuration.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a simplified single-file ResNet50 / ImageNet-parquet ingest and training benchmark script to measure throughput and pipeline latency. The review feedback highlights several opportunities for improvement: resolving a batch size inconsistency between the code and comments for the data_bound variant, using a deterministic hashing function like zlib.crc32 instead of hash() to ensure consistent label mapping across distributed workers, adding a fallback directory for profiling traces to prevent permission errors on local machines, and simplifying the percentile calculation using np.percentile.

…elease test Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…est_regression_test Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…n testing (ray-project#63775) Adds a release test (training_ingest_regression_test) that probes the Ray Data -> Ray Train ingest pipeline end-to-end via `iter_torch_batches`. Catches two complementary regression classes: - Peak object-store memory — back-pressured config (`--step-sleep-s=2.0` simulating a slow forward) fills consumer-side iter-batches buffers. This variant stress tests object store memory pressure and highlights the underestimation gap of the `prefetch_batches` implementation. - Throughput — same config without the sleep, so the data pipeline is the rate-limiter and any pipeline-rate regression shows up. This ensures that the changes in following PRs are safe to land. The goal of this test is to capture the tradeoff between reducing peak object store memory (and consequently prevent spilling) while maintaining high GPU saturation. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 3 commits May 29, 2026 15:19

[Train] Add simple_ingest_benchmark for Ray Data + Train ingest probing

753afd4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

nit

671eb8e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

lower batch size

2eed4d3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

justinvyu added 6 commits June 1, 2026 17:27

[Train] Rename benchmark to training_ingest_regression_test and add r…

690d1ae

…elease test Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[Train] Drop --variant and --manual-device-transfer from training_ing…

047b915

…est_regression_test Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update comment

f0f0811

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

move to data release test folder

6a0af15

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update compute config

98f56d7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

use benchmark util

d1aa15c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu changed the title ~~[WIP[[data][train] Add simple_ingest_benchmark for Ray Data + Train ingest regression testing~~ Jun 2, 2026

justinvyu marked this pull request as ready for review June 2, 2026 03:09

bump obj store mem fraction (this test shouldn't spill)

f355f84

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu enabled auto-merge (squash) June 2, 2026 07:36

github-actions Bot added the go add ONLY when ready to merge, run all tests label Jun 2, 2026

ray-gardener Bot added data Ray Data-related issues release-test release test labels Jun 2, 2026

xinyuangui2 approved these changes Jun 2, 2026

View reviewed changes

justinvyu merged commit d1d84e2 into ray-project:master Jun 2, 2026
10 checks passed

justinvyu deleted the justinvyu/simple-ingest-benchmark branch June 2, 2026 22:19

justinvyu mentioned this pull request Jun 3, 2026

[data] Fix iter_batches spilling (2/n): Replace inner format/collate make_async_gen with iter_threaded #63682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data][train] Add release test for Ray Data training ingest regression testing#63775

[data][train] Add release test for Ray Data training ingest regression testing#63775
justinvyu merged 10 commits into
ray-project:masterfrom
justinvyu:justinvyu/simple-ingest-benchmark

justinvyu commented Jun 1, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Labels

2 participants

Uh oh!

Conversation

justinvyu commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How is this different from existing training ingest tests?

Results

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Labels

2 participants

justinvyu commented Jun 1, 2026 •

edited

Loading