[Data] Expose flag to run read tasks on isolated worker processes#63490
Merged
Conversation
When DataContext.isolate_read_workers is True (the default), read tasks are submitted with a per-operator runtime_env so they get their own worker process pool. This prevents large memory allocations by PyArrow during reads from inflating the resident memory of workers that are later reused by subsequent operators. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
isolate_workers is now a constructor parameter on TaskPoolMapOperator (not the base MapOperator), since it only applies to task pools. MapOperator.create() still accepts it and logs a debug message when used with ActorPoolStrategy. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
When fusing map operators, the fused operator inherits isolate_workers=True if either input operator had it set. Exposes a read-only isolate_workers property on TaskPoolMapOperator. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
ayushk7102
approved these changes
May 27, 2026
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8808911. Configure here.
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
bveeramani
added a commit
that referenced
this pull request
Jun 3, 2026
In #63490, I added a flag to `DataContext` called `isolate_read_workers`. In this PR, I'm adding documentation to `read_parquet` describing how to use the flag. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
rueian
pushed a commit
to rueian/ray
that referenced
this pull request
Jun 4, 2026
…y-project#63490) ## Description PyArrow allocates lots of memory during reads. When the read task worker gets reused by downstream operators, that allocation isn't cleaned up. This causes problems because even if a downstream task doesn't require much memory, it's RSS can be many GBs and that causes unnecessary OOM kills. To mitigate this issue, I'm adding an `isolate_read_workers` flag to `DataContext`. It sets an environment variable on the reads' runtime environments so that they get scheduled on different workers than the downstream operators. I'm disabling this by default because the flag can cause performance regressions in some cases. ## Additional information Documentation out of scope for now -- will address in follow up. ## Related issues None. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
rueian
pushed a commit
to rueian/ray
that referenced
this pull request
Jun 4, 2026
…project#63816) In ray-project#63490, I added a flag to `DataContext` called `isolate_read_workers`. In this PR, I'm adding documentation to `read_parquet` describing how to use the flag. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
limarkdcunha
pushed a commit
to limarkdcunha/ray
that referenced
this pull request
Jun 30, 2026
…y-project#63490) ## Description PyArrow allocates lots of memory during reads. When the read task worker gets reused by downstream operators, that allocation isn't cleaned up. This causes problems because even if a downstream task doesn't require much memory, it's RSS can be many GBs and that causes unnecessary OOM kills. To mitigate this issue, I'm adding an `isolate_read_workers` flag to `DataContext`. It sets an environment variable on the reads' runtime environments so that they get scheduled on different workers than the downstream operators. I'm disabling this by default because the flag can cause performance regressions in some cases. ## Additional information Documentation out of scope for now -- will address in follow up. ## Related issues None. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
limarkdcunha
pushed a commit
to limarkdcunha/ray
that referenced
this pull request
Jun 30, 2026
…project#63816) In ray-project#63490, I added a flag to `DataContext` called `isolate_read_workers`. In this PR, I'm adding documentation to `read_parquet` describing how to use the flag. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description
PyArrow allocates lots of memory during reads. When the read task worker gets reused by downstream operators, that allocation isn't cleaned up. This causes problems because even if a downstream task doesn't require much memory, its RSS can be many GBs and that causes unnecessary OOM kills.
To mitigate this issue, I'm adding an
isolate_read_workersflag toDataContext. It sets an environment variable on the reads' runtime environments so that they get scheduled on different workers than the downstream operators.I'm disabling this by default because the flag can cause performance regressions in some cases.
Additional information
Documentation out of scope for now -- will address in follow up.
Related issues
None.