[Data] Expose flag to run read tasks on isolated worker processes by bveeramani · Pull Request #63490 · ray-project/ray

bveeramani · 2026-05-19T05:49:03Z

Description

PyArrow allocates lots of memory during reads. When the read task worker gets reused by downstream operators, that allocation isn't cleaned up. This causes problems because even if a downstream task doesn't require much memory, its RSS can be many GBs and that causes unnecessary OOM kills.

To mitigate this issue, I'm adding an isolate_read_workers flag to DataContext. It sets an environment variable on the reads' runtime environments so that they get scheduled on different workers than the downstream operators.

I'm disabling this by default because the flag can cause performance regressions in some cases.

Additional information

Documentation out of scope for now -- will address in follow up.

Related issues

None.

When DataContext.isolate_read_workers is True (the default), read tasks are submitted with a per-operator runtime_env so they get their own worker process pool. This prevents large memory allocations by PyArrow during reads from inflating the resident memory of workers that are later reused by subsequent operators. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

gemini-code-assist · 2026-05-19T05:49:07Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

isolate_workers is now a constructor parameter on TaskPoolMapOperator (not the base MapOperator), since it only applies to task pools. MapOperator.create() still accepts it and logs a debug message when used with ActorPoolStrategy. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

When fusing map operators, the fused operator inherits isolate_workers=True if either input operator had it set. Exposes a read-only isolate_workers property on TaskPoolMapOperator. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8808911. Configure here.}

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

In #63490, I added a flag to `DataContext` called `isolate_read_workers`. In this PR, I'm adding documentation to `read_parquet` describing how to use the flag. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…y-project#63490) ## Description PyArrow allocates lots of memory during reads. When the read task worker gets reused by downstream operators, that allocation isn't cleaned up. This causes problems because even if a downstream task doesn't require much memory, it's RSS can be many GBs and that causes unnecessary OOM kills. To mitigate this issue, I'm adding an `isolate_read_workers` flag to `DataContext`. It sets an environment variable on the reads' runtime environments so that they get scheduled on different workers than the downstream operators. I'm disabling this by default because the flag can cause performance regressions in some cases. ## Additional information Documentation out of scope for now -- will address in follow up. ## Related issues None. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…project#63816) In ray-project#63490, I added a flag to `DataContext` called `isolate_read_workers`. In this PR, I'm adding documentation to `read_parquet` describing how to use the flag. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…y-project#63490) ## Description PyArrow allocates lots of memory during reads. When the read task worker gets reused by downstream operators, that allocation isn't cleaned up. This causes problems because even if a downstream task doesn't require much memory, it's RSS can be many GBs and that causes unnecessary OOM kills. To mitigate this issue, I'm adding an `isolate_read_workers` flag to `DataContext`. It sets an environment variable on the reads' runtime environments so that they get scheduled on different workers than the downstream operators. I'm disabling this by default because the flag can cause performance regressions in some cases. ## Additional information Documentation out of scope for now -- will address in follow up. ## Related issues None. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…project#63816) In ray-project#63490, I added a flag to `DataContext` called `isolate_read_workers`. In this PR, I'm adding documentation to `read_parquet` describing how to use the flag. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

bveeramani requested a review from a team as a code owner May 19, 2026 05:49

cursor Bot reviewed May 19, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/operators/map_operator.py Outdated

bveeramani marked this pull request as draft May 19, 2026 05:55

bveeramani added 4 commits May 18, 2026 22:57

Add test

e3e1825

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Improve formatting

5b589f7

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani marked this pull request as ready for review May 27, 2026 18:27

bveeramani assigned justinvyu May 27, 2026

bveeramani changed the title ~~[Data] Isolate read workers into their own process pool~~ May 27, 2026

bveeramani changed the title ~~[Data] Expose flag to run read tasks on separate worker processes~~ May 27, 2026

ray-gardener Bot added the data Ray Data-related issues label May 27, 2026

ayushk7102 approved these changes May 27, 2026

View reviewed changes

Comment thread python/ray/data/context.py Outdated

Comment thread python/ray/data/_internal/execution/operators/task_pool_map_operator.py

bveeramani added 2 commits June 2, 2026 20:41

Merge branch 'master' into balaji/isolate-read-workers

6a5ab6c

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Address review comments

8808911

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

cursor Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/operators/task_pool_map_operator.py Outdated

bveeramani enabled auto-merge (squash) June 3, 2026 04:30

github-actions Bot added the go add ONLY when ready to merge, run all tests label Jun 3, 2026

Address review comments

62caf7f

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

github-actions Bot disabled auto-merge June 3, 2026 04:37

Fix formatting

7d61c56

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani enabled auto-merge (squash) June 3, 2026 04:39

bveeramani merged commit 6058f06 into master Jun 3, 2026
7 checks passed

bveeramani deleted the balaji/isolate-read-workers branch June 3, 2026 05:33

bveeramani mentioned this pull request Jun 3, 2026

[Data][Docs] Document isolate_read_workers for read_parquet #63816

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Expose flag to run read tasks on isolated worker processes#63490

[Data] Expose flag to run read tasks on isolated worker processes#63490
bveeramani merged 9 commits into
masterfrom
balaji/isolate-read-workers

bveeramani commented May 19, 2026 •

edited

Loading

gemini-code-assist Bot commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Labels

3 participants

Uh oh!

Conversation

bveeramani commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional information

Related issues

gemini-code-assist Bot commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

3 participants

bveeramani commented May 19, 2026 •

edited

Loading