[Data] Compute Expressions map#58743
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for map/dict operations on expression columns by adding a .map property to Expr and implementing the _MapNamespace with keys() and values() methods. The changes in expressions.py are straightforward. My review focuses on the new map_namespace.py file, where I've identified significant performance improvement opportunities. I'm suggesting replacing the current loop-based UDFs with vectorized PyArrow compute functions (pc.map_keys and pc.map_values), which is crucial for efficient data processing. I've also included suggestions for more precise return type inference in these methods.
There was a problem hiding this comment.
Bug: Structural equality overlooks critical flag.
The AliasExpr.structurally_equals method contains a comparison bug on the last condition: self._is_rename == self._is_rename always evaluates to True. It should compare self._is_rename == other._is_rename to properly check if two AliasExpr instances have the same rename flag, making the structural equality check incomplete.
python/ray/data/expressions.py#L1042-L1047
| # offsets and null-mask are preserved from the parent | ||
| return pyarrow.ListArray.from_arrays( | ||
| offsets=arr.offsets, values=child_array, mask=arr.is_null() | ||
| ) |
There was a problem hiding this comment.
Bug: MapArray lacks offsets attribute in older PyArrow
The code unconditionally accesses arr.offsets on line 85, but MapArray doesn't have an offsets attribute in PyArrow versions before 7.0.0. When Case 1 executes (line 49) for a MapArray in older PyArrow, the code extracts arr.keys or arr.items but then tries to access arr.offsets which doesn't exist. In PyArrow < 7.0.0, MapArray is not a subclass of ListArray and offsets must be accessed through the buffers API instead.
|
cc @goutamvenkat-anyscale |
Signed-off-by: ryankert01 <ryan980053@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Ryan Huang <ryankert01@gmail.com>
…r PyArrow compatibility Signed-off-by: ryankert01 <ryan980053@gmail.com>
Signed-off-by: ryankert01 <ryan980053@gmail.com>
Signed-off-by: ryankert01 <ryan980053@gmail.com>
iamjustinhsu
left a comment
There was a problem hiding this comment.
Thank you for contributing! Left a few comments
|
|
||
| @property | ||
| def map(self) -> "_MapNamespace": | ||
| """Access map/dict operations for this expression.""" |
There was a problem hiding this comment.
Can we add a docstring example like the _StructNamespace?
There was a problem hiding this comment.
Will do! List on todos
| if self._expr.data_type.is_arrow_type(): | ||
| arrow_type = self._expr.data_type.to_arrow_dtype() |
There was a problem hiding this comment.
I'm a little bit confused by this statement. In english, this sounds like: "Convert to arrow type if already arrow type". Did you mean to use not self._expr.dataset_.is_arrow_type()?
There was a problem hiding this comment.
It's right to be confused! Maybe I should do some comments. the first is_arrow_type() means check if this is arrow mem. The second to_arrow_dtype() unwrap the ray wrapper and use arrow wrapper.
| This handles both formal ``MapArray`` types and ``ListArray<Struct>`` | ||
| types (which represent maps physically) by accessing underlying | ||
| child arrays directly. |
There was a problem hiding this comment.
Can you elaborate what this function does?
There was a problem hiding this comment.
I want to implement a no copy, robust covert from ray wrapper -> pyarrow wrapper.
Signed-off-by: Ryan Huang <ryan980053@gmail.com>
Signed-off-by: ryankert01 <ryan980053@gmail.com>
|
|
||
| return pyarrow.ListArray.from_arrays( | ||
| offsets=offsets, values=child_array, mask=arr.is_null() | ||
| ) |
There was a problem hiding this comment.
Bug: Sliced array child_array not sliced with normalized offsets
When handling sliced Arrow arrays with non-zero starting offsets, the code normalizes offsets by subtracting start_offset to make them zero-based, but it does not correspondingly slice child_array. Since arr.keys, arr.items, and arr.values return the full underlying flat arrays (due to Arrow's zero-copy slice semantics), using normalized offsets with the full child_array causes each row to reference the wrong elements. For example, a slice of rows 7-9 would incorrectly return data from rows 0-2 of the original array. The child_array needs to be sliced from start_offset to the original offsets[-1] value to match the normalized offsets.
There was a problem hiding this comment.
Can we verify if this is True with a simple test?
There was a problem hiding this comment.
have test_map_sliced_offsets in the new pr.
iamjustinhsu
left a comment
There was a problem hiding this comment.
lgtm! cc: @goutamvenkat-anyscale
| offsets = arr.offsets | ||
| if len(offsets) > 0: # Handle offsets changes | ||
| start_offset = offsets[0] | ||
| if start_offset.as_py() != 0: |
There was a problem hiding this comment.
to clarify, the start_offset can be > 0 when we recursively go through a chunked array?
There was a problem hiding this comment.
Yeah, because as a chunked array, each chunk is a slice of the array which has a non-zero offset.
Signed-off-by: ryankert01 <ryan980053@gmail.com>
| if child_array is None: | ||
| try: | ||
| inner_type = arr.type.value_type | ||
| map_type = pyarrow.map_(inner_type[0].type, inner_type[1].type) | ||
| return _extract_map_component(arr.cast(map_type), component) | ||
| except (AttributeError, IndexError, pyarrow.ArrowInvalid): | ||
| raise ValueError( | ||
| f"Cannot extract {component} from array of type {arr.type}" | ||
| ) |
There was a problem hiding this comment.
Can you help me understand why we need to recursively cast to a map in non listarray and maparray cases?
Perhaps a concrete example would help.
There was a problem hiding this comment.
@goutamvenkat-anyscale It's an attempt to support unknown mapping, an example is
fixed_list_type = pa.list_(
pa.struct([('key', pa.string()), ('value', pa.int64())]),
list_size=2
)
arr = pa.array([
[{'key': 'a', 'value': 1}, {'key': 'b', 'value': 2}],
[{'key': 'x', 'value': 10}, {'key': 'y', 'value': 20}]
], type=fixed_list_type)feel free to let me know if this handling is necessary.
There was a problem hiding this comment.
I think we can simplify and leave this out for now. If folks have specific workloads that run into this case, we can re-evaluate.
|
Sorry, I re-fork my repo and cause an accidentally close, will open a new one soon. |
## Description
### `MapNamespace` impl.
- Implemented `_extract_map_component` as a robust, vectorized fallback
since native `pc.map_keys` kernels are not standard in PyArrow yet.
- **Support:** Handles both **Logical Maps** (`MapArray`) and **Physical
Maps** (`List<Struct>`).
### Testing
- `test_map_keys` / `test_map_values`: Standard extraction.
- `test_physical_map_extraction`: Verifies support for `List<Struct>`.
- `test_map_sliced_offsets`: Verifies the critical fix for sliced data.
- `test_map_nulls_and_empty`: Verifies handling of `None` and empty maps
`{}`.
- `test_map_chaining`: Verifies composition with List namespace (e.g.,
`.map.keys().list.len()`).
## Related issues
Related to #58674
Continues #58743
## Additional information
test w/
```
python -m pytest -v -s python/ray/data/tests/test_namespace_expressions.py::TestMapNamespace
```
<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> found 1
potential issue for commit <u>7a11478</u></sup><!-- /BUGBOT_STATUS -->
---------
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <hcr@apache.org>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Goutam <goutam@anyscale.com>
Description
MapNamespaceimpl._extract_map_componentas a robust, vectorized fallback since nativepc.map_keyskernels are not standard in PyArrow yet.MapArray) and Physical Maps (List<Struct>).Testing
test_map_keys/test_map_values: Standard extraction.test_physical_map_extraction: Verifies support forList<Struct>.test_map_sliced_offsets: Verifies the critical fix for sliced data.test_map_nulls_and_empty: Verifies handling ofNoneand empty maps{}.test_map_chaining: Verifies composition with List namespace (e.g.,.map.keys().list.len()).Related issues
Related to #58674
TODO
Additional information
test w/