[Data] Consider nans as nulls in preprocessor encoder#62618
Conversation
Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 65b7e12. Configure here.
| values = pc.drop_null(values) | ||
| else: | ||
| if pc.any(pc.is_null(values)).as_py(): | ||
| if pc.any(pc.is_null(values, nan_is_null=True)).as_py(): |
There was a problem hiding this comment.
drop_null doesn't remove NaN when drop_na_values=True
Medium Severity
When drop_na_values=True, pc.drop_null(values) on the float column only removes Arrow nulls—not NaN entries. Since this PR's goal is to treat NaN as null throughout the Arrow path, the else branch now correctly detects NaN via pc.is_null(values, nan_is_null=True), but the if drop_na_values branch silently leaves NaN values in the array. Those leftover NaNs then flow into sorting and encoding, producing incorrect results.
Reviewed by Cursor Bugbot for commit 65b7e12. Configure here.
There was a problem hiding this comment.
Code Review
This pull request updates the encoder.py preprocessor to treat floating-point NaNs as null values during validation and value index generation using PyArrow. Specifically, it modifies gen_value_index_arrow_from_arrow and _validate_arrow to use nan_is_null=True in null checks. Feedback indicates that the logic for dropping nulls in gen_value_index_arrow_from_arrow should also be updated to include NaNs for consistency, as pc.drop_null does not automatically handle them.
| values = pc.drop_null(values) | ||
| else: | ||
| if pc.any(pc.is_null(values)).as_py(): | ||
| if pc.any(pc.is_null(values, nan_is_null=True)).as_py(): |
There was a problem hiding this comment.
While this change correctly identifies NaNs as nulls when drop_na_values is False, the if drop_na_values block at line 1359 still uses pc.drop_null(values), which does not remove floating-point NaNs in Arrow. This will cause NaNs to be treated as valid categories during fitting when drop_na_values=True (e.g., in Categorizer), which is inconsistent with the PR's objective of treating NaNs as nulls.
Consider updating line 1359 to use a filter that accounts for NaNs, such as values.filter(pc.is_valid(values, nan_is_null=True)).
…ncoder (#62623) ## Description This is a follow-up PR to #62618 to ensure consistent behavior in `gen_value_index_arrow_from_arrow` of `encoder.py`. Previously, if `drop_na_values` was True, we were dropping nulls (and not NaNs). This PR ensures that we drop NaNs as well Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
) ## Description Ensure that floating-point NaN values are considered null in `encoder.py`, by adding adds nan_is_null=True to all pc.is_null() calls in the Arrow path pc.is_null(arr) on a float column returns False for NaN entries unless nan_is_null=True is passed. This means: 1. `unique_post_fn`: A column with NaN values and drop_na_values=False would silently proceed instead of raising the expected ValueError. 2. `_validate_arrow`: Columns containing NaN would pass null validation and then produce incorrect or unexpected encodings during transform. Signed-off-by: Ayush Kumar <ayushk7102@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>
…ncoder (ray-project#62623) ## Description This is a follow-up PR to ray-project#62618 to ensure consistent behavior in `gen_value_index_arrow_from_arrow` of `encoder.py`. Previously, if `drop_na_values` was True, we were dropping nulls (and not NaNs). This PR ensures that we drop NaNs as well Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
) ## Description Ensure that floating-point NaN values are considered null in `encoder.py`, by adding adds nan_is_null=True to all pc.is_null() calls in the Arrow path pc.is_null(arr) on a float column returns False for NaN entries unless nan_is_null=True is passed. This means: 1. `unique_post_fn`: A column with NaN values and drop_na_values=False would silently proceed instead of raising the expected ValueError. 2. `_validate_arrow`: Columns containing NaN would pass null validation and then produce incorrect or unexpected encodings during transform. Signed-off-by: Ayush Kumar <ayushk7102@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>
…ncoder (ray-project#62623) ## Description This is a follow-up PR to ray-project#62618 to ensure consistent behavior in `gen_value_index_arrow_from_arrow` of `encoder.py`. Previously, if `drop_na_values` was True, we were dropping nulls (and not NaNs). This PR ensures that we drop NaNs as well Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>


Description
Ensure that floating-point NaN values are considered null in
encoder.py, by adding adds nan_is_null=True to all pc.is_null() calls in the Arrow pathpc.is_null(arr) on a float column returns False for NaN entries unless nan_is_null=True is passed. This means:
gen_value_index_arrow_from_arrow: A column with NaN values and drop_na_values=False would silently proceed instead of raising the expected ValueError._validate_arrow: Columns containing NaN would pass null validation and then produce incorrect orunexpected encodings during transform.