[Data] Add map namespace support for expression operations#59879
Conversation
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
694b035 to
1bd4269
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces support for map/dict operations on expression columns by adding a map namespace. The implementation is well-structured, adding a _MapNamespace with keys() and values() methods that work on both logical MapArray and physical List<Struct> representations. The handling of sliced arrays with non-zero offsets is a great detail that ensures correctness. The accompanying tests are thorough, covering various representations, edge cases like nulls and empty maps, and integration with other namespaces.
I've added a couple of suggestions to map_namespace.py to further improve the robustness of the implementation by handling LargeListArray and providing clearer errors for unsupported types. Overall, this is a solid contribution that enhances Ray Data's expression capabilities.
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <hcr@apache.org>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
| assert list(rows[0]["keys"]) == ["a"] and list(rows[0]["values"]) == [1] | ||
| assert len(rows[1]["keys"]) == 0 and len(rows[1]["values"]) == 0 | ||
| assert rows[2]["keys"] is None and rows[2]["values"] is None |
There was a problem hiding this comment.
Let's use rows_same
There was a problem hiding this comment.
row_same operates on pandas that can't handle the mixed None/list column when converting. The to_pandas() path triggers TensorArray casting which fails on the mixed types. Let's keep it!
Although there's workaround, but is too complex for the context of this test:
ctx = ray.data.context.DataContext.get_current()
ctx.enable_tensor_extension_casting = False
try:
result = (
ds.with_column("keys", col("m").map.keys())
.with_column("values", col("m").map.values())
.to_pandas()
)
expected = pd.DataFrame(
{
"keys": [["a"], [], None],
"values": [[1], [], None],
}
)
_assert_result(result, expected, drop_cols=["m"])
finally:
ctx.enable_tensor_extension_casting = True| if start_offset.as_py() != 0: | ||
| end_offset = offsets[-1].as_py() | ||
| child_array = child_array.slice( | ||
| offset=start_offset.as_py(), length=end_offset - start_offset.as_py() |
There was a problem hiding this comment.
Don't believe you need to call as_py here
goutamvenkat-anyscale
left a comment
There was a problem hiding this comment.
Please look at open comments. Thanks
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
| ) | ||
|
|
||
|
|
||
| def _rebuild_list_array( |
There was a problem hiding this comment.
Help me understand why we need to do this?
There was a problem hiding this comment.
It's because when we slice a MapArray or ListArray, the child arrays (keys/values) remain unchanged and offsets still reference positions in the original buffer. (zero-copy slicing by pyArrow)
-> we have to re-build it to 0-based.
|
@ryankert01 There seems to be some open comments. Please address those |
…pyarrow.Array] Signed-off-by: Hsien-Cheng Huang <hcr@apache.org>
2e376db to
5df9b8f
Compare
…rgeListArray Signed-off-by: Hsien-Cheng Huang <hcr@apache.org>
Description
MapNamespaceimpl._extract_map_componentas a robust, vectorized fallback since nativepc.map_keyskernels are not standard in PyArrow yet.MapArray) and Physical Maps (List<Struct>).Testing
test_map_keys/test_map_values: Standard extraction.test_physical_map_extraction: Verifies support forList<Struct>.test_map_sliced_offsets: Verifies the critical fix for sliced data.test_map_nulls_and_empty: Verifies handling ofNoneand empty maps{}.test_map_chaining: Verifies composition with List namespace (e.g.,.map.keys().list.len()).Related issues
Related to #58674
Continues #58743
Additional information
test w/