Skip to content

[Data] Add map namespace support for expression operations#59879

Merged
richardliaw merged 27 commits into
ray-project:masterfrom
ryankert01:map-expression
Mar 4, 2026
Merged

[Data] Add map namespace support for expression operations#59879
richardliaw merged 27 commits into
ray-project:masterfrom
ryankert01:map-expression

Conversation

@ryankert01

@ryankert01 ryankert01 commented Jan 6, 2026

Copy link
Copy Markdown
Member

Description

MapNamespace impl.

  • Implemented _extract_map_component as a robust, vectorized fallback since native pc.map_keys kernels are not standard in PyArrow yet.
  • Support: Handles both Logical Maps (MapArray) and Physical Maps (List<Struct>).

Testing

  • test_map_keys / test_map_values: Standard extraction.
  • test_physical_map_extraction: Verifies support for List<Struct>.
  • test_map_sliced_offsets: Verifies the critical fix for sliced data.
  • test_map_nulls_and_empty: Verifies handling of None and empty maps {}.
  • test_map_chaining: Verifies composition with List namespace (e.g., .map.keys().list.len()).

Related issues

Related to #58674
Continues #58743

Additional information

test w/

python -m pytest -v -s python/ray/data/tests/test_namespace_expressions.py::TestMapNamespace
Cursor Bugbot found 1 potential issue for commit 7a11478
@ryankert01 ryankert01 requested a review from a team as a code owner January 6, 2026 06:41
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Comment thread python/ray/data/namespace_expressions/map_namespace.py

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for map/dict operations on expression columns by adding a map namespace. The implementation is well-structured, adding a _MapNamespace with keys() and values() methods that work on both logical MapArray and physical List<Struct> representations. The handling of sliced arrays with non-zero offsets is a great detail that ensures correctness. The accompanying tests are thorough, covering various representations, edge cases like nulls and empty maps, and integration with other namespaces.

I've added a couple of suggestions to map_namespace.py to further improve the robustness of the implementation by handling LargeListArray and providing clearer errors for unsupported types. Overall, this is a solid contribution that enhances Ray Data's expression capabilities.

Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 6, 2026
Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
Comment thread python/ray/data/namespace_expressions/map_namespace.py
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
@ryankert01

Copy link
Copy Markdown
Member Author
Comment thread python/ray/data/namespace_expressions/map_namespace.py

@owenowenisme owenowenisme left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor fixes, overall LGTM

Comment thread python/ray/data/namespace_expressions/map_namespace.py
Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
ryankert01 and others added 3 commits January 22, 2026 19:57
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <hcr@apache.org>
Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
Comment on lines +120 to +122
assert list(rows[0]["keys"]) == ["a"] and list(rows[0]["values"]) == [1]
assert len(rows[1]["keys"]) == 0 and len(rows[1]["values"]) == 0
assert rows[2]["keys"] is None and rows[2]["values"] is None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use rows_same

@ryankert01 ryankert01 Feb 8, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row_same operates on pandas that can't handle the mixed None/list column when converting. The to_pandas() path triggers TensorArray casting which fails on the mixed types. Let's keep it!

Although there's workaround, but is too complex for the context of this test:

    ctx = ray.data.context.DataContext.get_current()
    ctx.enable_tensor_extension_casting = False
    try:
        result = (
            ds.with_column("keys", col("m").map.keys())
            .with_column("values", col("m").map.values())
            .to_pandas()
        )
        expected = pd.DataFrame(
            {
                "keys": [["a"], [], None],
                "values": [[1], [], None],
            }
        )
        _assert_result(result, expected, drop_cols=["m"])
    finally:
        ctx.enable_tensor_extension_casting = True
Comment thread python/ray/data/tests/expressions/test_namespace_map.py
Comment on lines +78 to +81
if start_offset.as_py() != 0:
end_offset = offsets[-1].as_py()
child_array = child_array.slice(
offset=start_offset.as_py(), length=end_offset - start_offset.as_py()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't believe you need to call as_py here

@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Feb 4, 2026
Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated

@goutamvenkat-anyscale goutamvenkat-anyscale left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look at open comments. Thanks

Signed-off-by: Ryan Huang <ryankert01@gmail.com>

@goutamvenkat-anyscale goutamvenkat-anyscale left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
)


def _rebuild_list_array(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me understand why we need to do this?

@ryankert01 ryankert01 Mar 3, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because when we slice a MapArray or ListArray, the child arrays (keys/values) remain unchanged and offsets still reference positions in the original buffer. (zero-copy slicing by pyArrow)

-> we have to re-build it to 0-based.

@goutamvenkat-anyscale

Copy link
Copy Markdown
Contributor

@ryankert01 There seems to be some open comments. Please address those

@goutamvenkat-anyscale goutamvenkat-anyscale added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Feb 25, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Comment thread python/ray/data/namespace_expressions/map_namespace.py
…rgeListArray

Signed-off-by: Hsien-Cheng Huang <hcr@apache.org>
@ryankert01

Copy link
Copy Markdown
Member Author
@richardliaw richardliaw merged commit 6ddbbdd into ray-project:master Mar 4, 2026
6 checks passed
@ryankert01 ryankert01 deleted the map-expression branch March 4, 2026 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. go add ONLY when ready to merge, run all tests

6 participants