[Data] Skip unconditional null strip in `find_partition_index` by owenowenisme · Pull Request #62594 · ray-project/ray

owenowenisme · 2026-04-14T04:28:32Z

Description

find_partition_index was changed to unconditionally run pd.isna(col_vals) + boolean indexing on every iteration to strip nulls.

This is O(n) with an array allocation on every call, and find_partition_index is called O(blocks × boundaries × columns) times during the sort-shuffle map phase. For SF100 with no nulls (the common case), this added ~10k+ unnecessary allocations per task, causing map tasks to regress from ~3-4 min to ~5-6 min (~50% slower).

Fix: use Arrow's O(1) column.null_count to skip the expensive path when there are no nulls.

Release test result

Regressed runtime: 600+ seconds

Result of case main: {'time': 380.45604542899997, 'object_store_spilled_total_gb': 0.0, 'sf': '100', 'group_by': ['column08', 'column13', 'column14'], 'shuffle_strategy': 'sort_shuffle_pull_based', 'aggregate': True, 'map_groups': False}
--
Finished benchmark, metrics exported to '/tmp/release_test_out.json':
{
"main": {
"time": 424.48728577200006,
"object_store_spilled_total_gb": 0.0,
"sf": "100",
"group_by": [
"column08",
"column13",
"column14"
],
"shuffle_strategy": "sort_shuffle_pull_based",
"aggregate": true,
"map_groups": false
}
}

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

gemini-code-assist · 2026-04-14T04:28:35Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 486dfdfbdaca9fd15634dabfb3a7f5c7806ca706. Configure here.}

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

…roject#62594) ## Description `find_partition_index` was changed to unconditionally run pd.isna(col_vals) + boolean indexing on every iteration to strip nulls. This is O(n) with an array allocation on every call, and find_partition_index is called O(blocks × boundaries × columns) times during the sort-shuffle map phase. For SF100 with no nulls (the common case), this added ~10k+ unnecessary allocations per task, causing map tasks to regress from ~3-4 min to ~5-6 min (~50% slower). Fix: use Arrow's O(1) column.null_count to skip the expensive path when there are no nulls. ### Release test result Regressed runtime: 600+ seconds ``` Result of case main: {'time': 380.45604542899997, 'object_store_spilled_total_gb': 0.0, 'sf': '100', 'group_by': ['column08', 'column13', 'column14'], 'shuffle_strategy': 'sort_shuffle_pull_based', 'aggregate': True, 'map_groups': False} -- Finished benchmark, metrics exported to '/tmp/release_test_out.json': { "main": { "time": 424.48728577200006, "object_store_spilled_total_gb": 0.0, "sf": "100", "group_by": [ "column08", "column13", "column14" ], "shuffle_strategy": "sort_shuffle_pull_based", "aggregate": true, "map_groups": false } } ``` ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

owenowenisme requested a review from a team as a code owner April 14, 2026 04:28

cursor Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread python/ray/data/_internal/util.py Outdated

owenowenisme force-pushed the data/skip-null-check-in-find-partition-index branch from 793d4bb to 486dfdf Compare April 14, 2026 04:41

cursor Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread python/ray/data/_internal/util.py

update

7464f32

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

owenowenisme force-pushed the data/skip-null-check-in-find-partition-index branch from 486dfdf to 7464f32 Compare April 14, 2026 04:51

ray-gardener Bot added the data Ray Data-related issues label Apr 14, 2026

update

1abde92

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

owenowenisme added the go add ONLY when ready to merge, run all tests label Apr 14, 2026

handle NaN case and add unit test

2e6691d

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

goutamvenkat-anyscale approved these changes Apr 14, 2026

View reviewed changes

goutamvenkat-anyscale merged commit 4396278 into ray-project:master Apr 14, 2026
6 checks passed

owenowenisme mentioned this pull request May 20, 2026

[Data] Strip null tail once in find_partition_index #63542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Skip unconditional null strip in `find_partition_index`#62594

[Data] Skip unconditional null strip in `find_partition_index`#62594
goutamvenkat-anyscale merged 3 commits into
ray-project:masterfrom
owenowenisme:data/skip-null-check-in-find-partition-index

owenowenisme commented Apr 14, 2026 •

edited

Loading

gemini-code-assist Bot commented Apr 14, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Labels

2 participants

Uh oh!

Conversation

owenowenisme commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release test result

Related issues

Additional information

gemini-code-assist Bot commented Apr 14, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

2 participants

owenowenisme commented Apr 14, 2026 •

edited

Loading