Skip to content

[Data] Skip unconditional null strip in find_partition_index#62594

Merged
goutamvenkat-anyscale merged 3 commits into
ray-project:masterfrom
owenowenisme:data/skip-null-check-in-find-partition-index
Apr 14, 2026
Merged

[Data] Skip unconditional null strip in find_partition_index#62594
goutamvenkat-anyscale merged 3 commits into
ray-project:masterfrom
owenowenisme:data/skip-null-check-in-find-partition-index

Conversation

@owenowenisme

@owenowenisme owenowenisme commented Apr 14, 2026

Copy link
Copy Markdown
Member

Description

find_partition_index was changed to unconditionally run pd.isna(col_vals) + boolean indexing on every iteration to strip nulls.

This is O(n) with an array allocation on every call, and find_partition_index is called O(blocks × boundaries × columns) times during the sort-shuffle map phase. For SF100 with no nulls (the common case), this added ~10k+ unnecessary allocations per task, causing map tasks to regress from ~3-4 min to ~5-6 min (~50% slower).

Fix: use Arrow's O(1) column.null_count to skip the expensive path when there are no nulls.

Release test result

Regressed runtime: 600+ seconds

Result of case main: {'time': 380.45604542899997, 'object_store_spilled_total_gb': 0.0, 'sf': '100', 'group_by': ['column08', 'column13', 'column14'], 'shuffle_strategy': 'sort_shuffle_pull_based', 'aggregate': True, 'map_groups': False}
--
Finished benchmark, metrics exported to '/tmp/release_test_out.json':
{
"main": {
"time": 424.48728577200006,
"object_store_spilled_total_gb": 0.0,
"sf": "100",
"group_by": [
"column08",
"column13",
"column14"
],
"shuffle_strategy": "sort_shuffle_pull_based",
"aggregate": true,
"map_groups": false
}
}

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@owenowenisme owenowenisme requested a review from a team as a code owner April 14, 2026 04:28
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Comment thread python/ray/data/_internal/util.py Outdated
@owenowenisme owenowenisme force-pushed the data/skip-null-check-in-find-partition-index branch from 793d4bb to 486dfdf Compare April 14, 2026 04:41

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 486dfdfbdaca9fd15634dabfb3a7f5c7806ca706. Configure here.

Comment thread python/ray/data/_internal/util.py
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
@owenowenisme owenowenisme force-pushed the data/skip-null-check-in-find-partition-index branch from 486dfdf to 7464f32 Compare April 14, 2026 04:51
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Apr 14, 2026
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
@owenowenisme owenowenisme added the go add ONLY when ready to merge, run all tests label Apr 14, 2026
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
@goutamvenkat-anyscale goutamvenkat-anyscale merged commit 4396278 into ray-project:master Apr 14, 2026
6 checks passed
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
…roject#62594)

## Description
`find_partition_index` was changed to unconditionally run
pd.isna(col_vals) + boolean indexing on every iteration to strip nulls.

This is O(n) with an array allocation on every call, and
find_partition_index is called O(blocks × boundaries × columns) times
during the sort-shuffle map phase. For SF100 with no nulls (the common
case), this added ~10k+ unnecessary allocations per task, causing map
tasks to regress from ~3-4 min to ~5-6 min (~50% slower).

Fix: use Arrow's O(1) column.null_count to skip the expensive path when
there are no nulls.

### Release test result
Regressed runtime: 600+ seconds
```
Result of case main: {'time': 380.45604542899997, 'object_store_spilled_total_gb': 0.0, 'sf': '100', 'group_by': ['column08', 'column13', 'column14'], 'shuffle_strategy': 'sort_shuffle_pull_based', 'aggregate': True, 'map_groups': False}
--
Finished benchmark, metrics exported to '/tmp/release_test_out.json':
{
"main": {
"time": 424.48728577200006,
"object_store_spilled_total_gb": 0.0,
"sf": "100",
"group_by": [
"column08",
"column13",
"column14"
],
"shuffle_strategy": "sort_shuffle_pull_based",
"aggregate": true,
"map_groups": false
}
}

```
## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…roject#62594)

## Description
`find_partition_index` was changed to unconditionally run
pd.isna(col_vals) + boolean indexing on every iteration to strip nulls.

This is O(n) with an array allocation on every call, and
find_partition_index is called O(blocks × boundaries × columns) times
during the sort-shuffle map phase. For SF100 with no nulls (the common
case), this added ~10k+ unnecessary allocations per task, causing map
tasks to regress from ~3-4 min to ~5-6 min (~50% slower).

Fix: use Arrow's O(1) column.null_count to skip the expensive path when
there are no nulls.

### Release test result
Regressed runtime: 600+ seconds
```
Result of case main: {'time': 380.45604542899997, 'object_store_spilled_total_gb': 0.0, 'sf': '100', 'group_by': ['column08', 'column13', 'column14'], 'shuffle_strategy': 'sort_shuffle_pull_based', 'aggregate': True, 'map_groups': False}
--
Finished benchmark, metrics exported to '/tmp/release_test_out.json':
{
"main": {
"time": 424.48728577200006,
"object_store_spilled_total_gb": 0.0,
"sf": "100",
"group_by": [
"column08",
"column13",
"column14"
],
"shuffle_strategy": "sort_shuffle_pull_based",
"aggregate": true,
"map_groups": false
}
}

```
## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants