Skip to content

[Data] Expose flag to run read tasks on isolated worker processes#63490

Merged
bveeramani merged 9 commits into
masterfrom
balaji/isolate-read-workers
Jun 3, 2026
Merged

[Data] Expose flag to run read tasks on isolated worker processes#63490
bveeramani merged 9 commits into
masterfrom
balaji/isolate-read-workers

Conversation

@bveeramani

@bveeramani bveeramani commented May 19, 2026

Copy link
Copy Markdown
Member

Description

PyArrow allocates lots of memory during reads. When the read task worker gets reused by downstream operators, that allocation isn't cleaned up. This causes problems because even if a downstream task doesn't require much memory, its RSS can be many GBs and that causes unnecessary OOM kills.

To mitigate this issue, I'm adding an isolate_read_workers flag to DataContext. It sets an environment variable on the reads' runtime environments so that they get scheduled on different workers than the downstream operators.

I'm disabling this by default because the flag can cause performance regressions in some cases.

Additional information

Documentation out of scope for now -- will address in follow up.

Related issues

None.

When DataContext.isolate_read_workers is True (the default), read tasks
are submitted with a per-operator runtime_env so they get their own
worker process pool. This prevents large memory allocations by PyArrow
during reads from inflating the resident memory of workers that are
later reused by subsequent operators.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner May 19, 2026 05:49
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Comment thread python/ray/data/_internal/execution/operators/map_operator.py Outdated
@bveeramani bveeramani marked this pull request as draft May 19, 2026 05:55
isolate_workers is now a constructor parameter on TaskPoolMapOperator
(not the base MapOperator), since it only applies to task pools.
MapOperator.create() still accepts it and logs a debug message when
used with ActorPoolStrategy.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
When fusing map operators, the fused operator inherits
isolate_workers=True if either input operator had it set. Exposes
a read-only isolate_workers property on TaskPoolMapOperator.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani marked this pull request as ready for review May 27, 2026 18:27
@bveeramani bveeramani changed the title [Data] Isolate read workers into their own process pool May 27, 2026
@bveeramani bveeramani changed the title [Data] Expose flag to run read tasks on separate worker processes May 27, 2026
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label May 27, 2026
Comment thread python/ray/data/context.py Outdated
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8808911. Configure here.

Comment thread python/ray/data/_internal/execution/operators/task_pool_map_operator.py Outdated
@bveeramani bveeramani enabled auto-merge (squash) June 3, 2026 04:30
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label Jun 3, 2026
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@github-actions github-actions Bot disabled auto-merge June 3, 2026 04:37
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) June 3, 2026 04:39
@bveeramani bveeramani merged commit 6058f06 into master Jun 3, 2026
7 checks passed
@bveeramani bveeramani deleted the balaji/isolate-read-workers branch June 3, 2026 05:33
bveeramani added a commit that referenced this pull request Jun 3, 2026
In #63490, I added a flag to
`DataContext` called `isolate_read_workers`.

In this PR, I'm adding documentation to `read_parquet` describing how to
use the flag.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
…y-project#63490)

## Description

PyArrow allocates lots of memory during reads. When the read task worker
gets reused by downstream operators, that allocation isn't cleaned up.
This causes problems because even if a downstream task doesn't require
much memory, it's RSS can be many GBs and that causes unnecessary OOM
kills.

To mitigate this issue, I'm adding an `isolate_read_workers` flag to
`DataContext`. It sets an environment variable on the reads' runtime
environments so that they get scheduled on different workers than the
downstream operators.

I'm disabling this by default because the flag can cause performance
regressions in some cases.

## Additional information

Documentation out of scope for now -- will address in follow up.

## Related issues

None.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
…project#63816)

In ray-project#63490, I added a flag to
`DataContext` called `isolate_read_workers`.

In this PR, I'm adding documentation to `read_parquet` describing how to
use the flag.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…y-project#63490)

## Description

PyArrow allocates lots of memory during reads. When the read task worker
gets reused by downstream operators, that allocation isn't cleaned up.
This causes problems because even if a downstream task doesn't require
much memory, it's RSS can be many GBs and that causes unnecessary OOM
kills.

To mitigate this issue, I'm adding an `isolate_read_workers` flag to
`DataContext`. It sets an environment variable on the reads' runtime
environments so that they get scheduled on different workers than the
downstream operators.

I'm disabling this by default because the flag can cause performance
regressions in some cases.

## Additional information

Documentation out of scope for now -- will address in follow up.

## Related issues

None.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…project#63816)

In ray-project#63490, I added a flag to
`DataContext` called `isolate_read_workers`.

In this PR, I'm adding documentation to `read_parquet` describing how to
use the flag.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

3 participants