[Data] Consider nans as nulls in preprocessor encoder by ayushk7102 · Pull Request #62618 · ray-project/ray

ayushk7102 · 2026-04-14T21:50:03Z

Description

Ensure that floating-point NaN values are considered null in encoder.py, by adding adds nan_is_null=True to all pc.is_null() calls in the Arrow path

pc.is_null(arr) on a float column returns False for NaN entries unless nan_is_null=True is passed. This means:

gen_value_index_arrow_from_arrow: A column with NaN values and drop_na_values=False would silently proceed instead of raising the expected ValueError.
_validate_arrow: Columns containing NaN would pass null validation and then produce incorrect or
unexpected encodings during transform.

Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 65b7e12. Configure here.}

cursor · 2026-04-14T21:51:42Z

            values = pc.drop_null(values)
        else:
-            if pc.any(pc.is_null(values)).as_py():
+            if pc.any(pc.is_null(values, nan_is_null=True)).as_py():


drop_null doesn't remove NaN when drop_na_values=True

Medium Severity

When drop_na_values=True, pc.drop_null(values) on the float column only removes Arrow nulls—not NaN entries. Since this PR's goal is to treat NaN as null throughout the Arrow path, the else branch now correctly detects NaN via pc.is_null(values, nan_is_null=True), but the if drop_na_values branch silently leaves NaN values in the array. Those leftover NaNs then flow into sorting and encoding, producing incorrect results.

^{Reviewed by Cursor Bugbot for commit 65b7e12. Configure here.}

gemini-code-assist

Code Review

This pull request updates the encoder.py preprocessor to treat floating-point NaNs as null values during validation and value index generation using PyArrow. Specifically, it modifies gen_value_index_arrow_from_arrow and _validate_arrow to use nan_is_null=True in null checks. Feedback indicates that the logic for dropping nulls in gen_value_index_arrow_from_arrow should also be updated to include NaNs for consistency, as pc.drop_null does not automatically handle them.

gemini-code-assist · 2026-04-14T21:54:35Z

            values = pc.drop_null(values)
        else:
-            if pc.any(pc.is_null(values)).as_py():
+            if pc.any(pc.is_null(values, nan_is_null=True)).as_py():


While this change correctly identifies NaNs as nulls when drop_na_values is False, the if drop_na_values block at line 1359 still uses pc.drop_null(values), which does not remove floating-point NaNs in Arrow. This will cause NaNs to be treated as valid categories during fitting when drop_na_values=True (e.g., in Categorizer), which is inconsistent with the PR's objective of treating NaNs as nulls.

Consider updating line 1359 to use a filter that accounts for NaNs, such as values.filter(pc.is_valid(values, nan_is_null=True)).

…ncoder (#62623) ## Description This is a follow-up PR to #62618 to ensure consistent behavior in `gen_value_index_arrow_from_arrow` of `encoder.py`. Previously, if `drop_na_values` was True, we were dropping nulls (and not NaNs). This PR ensures that we drop NaNs as well Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>

) ## Description Ensure that floating-point NaN values are considered null in `encoder.py`, by adding adds nan_is_null=True to all pc.is_null() calls in the Arrow path pc.is_null(arr) on a float column returns False for NaN entries unless nan_is_null=True is passed. This means: 1. `unique_post_fn`: A column with NaN values and drop_na_values=False would silently proceed instead of raising the expected ValueError. 2. `_validate_arrow`: Columns containing NaN would pass null validation and then produce incorrect or unexpected encodings during transform. Signed-off-by: Ayush Kumar <ayushk7102@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>

…ncoder (ray-project#62623) ## Description This is a follow-up PR to ray-project#62618 to ensure consistent behavior in `gen_value_index_arrow_from_arrow` of `encoder.py`. Previously, if `drop_na_values` was True, we were dropping nulls (and not NaNs). This PR ensures that we drop NaNs as well Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>

) ## Description Ensure that floating-point NaN values are considered null in `encoder.py`, by adding adds nan_is_null=True to all pc.is_null() calls in the Arrow path pc.is_null(arr) on a float column returns False for NaN entries unless nan_is_null=True is passed. This means: 1. `unique_post_fn`: A column with NaN values and drop_na_values=False would silently proceed instead of raising the expected ValueError. 2. `_validate_arrow`: Columns containing NaN would pass null validation and then produce incorrect or unexpected encodings during transform. Signed-off-by: Ayush Kumar <ayushk7102@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>

…ncoder (ray-project#62623) ## Description This is a follow-up PR to ray-project#62618 to ensure consistent behavior in `gen_value_index_arrow_from_arrow` of `encoder.py`. Previously, if `drop_na_values` was True, we were dropping nulls (and not NaNs). This PR ensures that we drop NaNs as well Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>

[Data] Consider nans as nulls in preprocessor encoder

65b7e12

Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>

ayushk7102 requested a review from a team as a code owner April 14, 2026 21:50

goutamvenkat-anyscale approved these changes Apr 14, 2026

View reviewed changes

goutamvenkat-anyscale enabled auto-merge (squash) April 14, 2026 21:51

github-actions Bot added the go add ONLY when ready to merge, run all tests label Apr 14, 2026

cursor Bot reviewed Apr 14, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 14, 2026

View reviewed changes

Merge branch 'master' into fix_encoder_div

199cb62

github-actions Bot disabled auto-merge April 14, 2026 22:38

goutamvenkat-anyscale enabled auto-merge (squash) April 14, 2026 22:38

goutamvenkat-anyscale merged commit b375e76 into ray-project:master Apr 14, 2026
7 checks passed

ayushk7102 mentioned this pull request Apr 15, 2026

[Data] Ensure consistent nan_is_null semantics in unique_post_fn of encoder #62623

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Consider nans as nulls in preprocessor encoder#62618

[Data] Consider nans as nulls in preprocessor encoder#62618
goutamvenkat-anyscale merged 2 commits into
ray-project:masterfrom
ayushk7102:fix_encoder_div

ayushk7102 commented Apr 14, 2026 •

edited

Loading

cursor Bot left a comment

cursor Bot Apr 14, 2026

gemini-code-assist Bot left a comment

gemini-code-assist Bot Apr 14, 2026

Uh oh!

Labels

2 participants

Uh oh!

Conversation

ayushk7102 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot Apr 14, 2026

Choose a reason for hiding this comment

drop_null doesn't remove NaN when drop_na_values=True

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Labels

2 participants

ayushk7102 commented Apr 14, 2026 •

edited

Loading

`drop_null` doesn't remove NaN when `drop_na_values=True`