Skip to content

[Data] Consider nans as nulls in preprocessor encoder#62618

Merged
goutamvenkat-anyscale merged 2 commits into
ray-project:masterfrom
ayushk7102:fix_encoder_div
Apr 14, 2026
Merged

[Data] Consider nans as nulls in preprocessor encoder#62618
goutamvenkat-anyscale merged 2 commits into
ray-project:masterfrom
ayushk7102:fix_encoder_div

Conversation

@ayushk7102

@ayushk7102 ayushk7102 commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Description

Ensure that floating-point NaN values are considered null in encoder.py, by adding adds nan_is_null=True to all pc.is_null() calls in the Arrow path

pc.is_null(arr) on a float column returns False for NaN entries unless nan_is_null=True is passed. This means:

  1. gen_value_index_arrow_from_arrow: A column with NaN values and drop_na_values=False would silently proceed instead of raising the expected ValueError.
  2. _validate_arrow: Columns containing NaN would pass null validation and then produce incorrect or
    unexpected encodings during transform.
Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
@ayushk7102 ayushk7102 requested a review from a team as a code owner April 14, 2026 21:50
@goutamvenkat-anyscale goutamvenkat-anyscale enabled auto-merge (squash) April 14, 2026 21:51
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label Apr 14, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 65b7e12. Configure here.

values = pc.drop_null(values)
else:
if pc.any(pc.is_null(values)).as_py():
if pc.any(pc.is_null(values, nan_is_null=True)).as_py():

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop_null doesn't remove NaN when drop_na_values=True

Medium Severity

When drop_na_values=True, pc.drop_null(values) on the float column only removes Arrow nulls—not NaN entries. Since this PR's goal is to treat NaN as null throughout the Arrow path, the else branch now correctly detects NaN via pc.is_null(values, nan_is_null=True), but the if drop_na_values branch silently leaves NaN values in the array. Those leftover NaNs then flow into sorting and encoding, producing incorrect results.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 65b7e12. Configure here.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the encoder.py preprocessor to treat floating-point NaNs as null values during validation and value index generation using PyArrow. Specifically, it modifies gen_value_index_arrow_from_arrow and _validate_arrow to use nan_is_null=True in null checks. Feedback indicates that the logic for dropping nulls in gen_value_index_arrow_from_arrow should also be updated to include NaNs for consistency, as pc.drop_null does not automatically handle them.

values = pc.drop_null(values)
else:
if pc.any(pc.is_null(values)).as_py():
if pc.any(pc.is_null(values, nan_is_null=True)).as_py():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this change correctly identifies NaNs as nulls when drop_na_values is False, the if drop_na_values block at line 1359 still uses pc.drop_null(values), which does not remove floating-point NaNs in Arrow. This will cause NaNs to be treated as valid categories during fitting when drop_na_values=True (e.g., in Categorizer), which is inconsistent with the PR's objective of treating NaNs as nulls.

Consider updating line 1359 to use a filter that accounts for NaNs, such as values.filter(pc.is_valid(values, nan_is_null=True)).

@github-actions github-actions Bot disabled auto-merge April 14, 2026 22:38
@goutamvenkat-anyscale goutamvenkat-anyscale enabled auto-merge (squash) April 14, 2026 22:38
@goutamvenkat-anyscale goutamvenkat-anyscale merged commit b375e76 into ray-project:master Apr 14, 2026
7 checks passed
goutamvenkat-anyscale pushed a commit that referenced this pull request Apr 15, 2026
…ncoder (#62623)

## Description
This is a follow-up PR to #62618
to ensure consistent behavior in `gen_value_index_arrow_from_arrow` of
`encoder.py`. Previously, if `drop_na_values` was True, we were dropping
nulls (and not NaNs). This PR ensures that we drop NaNs as well

Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
)

## Description
Ensure that floating-point NaN values are considered null in
`encoder.py`, by adding adds nan_is_null=True to all pc.is_null() calls
in the Arrow path

pc.is_null(arr) on a float column returns False for NaN entries unless
nan_is_null=True is passed. This means:

1. `unique_post_fn`: A column with NaN values and drop_na_values=False
would silently proceed instead of
  raising the expected ValueError.
2. `_validate_arrow`: Columns containing NaN would pass null validation
and then produce incorrect or
  unexpected encodings during transform.

Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
…ncoder (ray-project#62623)

## Description
This is a follow-up PR to ray-project#62618
to ensure consistent behavior in `gen_value_index_arrow_from_arrow` of
`encoder.py`. Previously, if `drop_na_values` was True, we were dropping
nulls (and not NaNs). This PR ensures that we drop NaNs as well

Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
)

## Description
Ensure that floating-point NaN values are considered null in
`encoder.py`, by adding adds nan_is_null=True to all pc.is_null() calls
in the Arrow path

pc.is_null(arr) on a float column returns False for NaN entries unless
nan_is_null=True is passed. This means:

1. `unique_post_fn`: A column with NaN values and drop_na_values=False
would silently proceed instead of
  raising the expected ValueError.
2. `_validate_arrow`: Columns containing NaN would pass null validation
and then produce incorrect or
  unexpected encodings during transform.

Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ncoder (ray-project#62623)

## Description
This is a follow-up PR to ray-project#62618
to ensure consistent behavior in `gen_value_index_arrow_from_arrow` of
`encoder.py`. Previously, if `drop_na_values` was True, we were dropping
nulls (and not NaNs). This PR ensures that we drop NaNs as well

Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

2 participants