Skip to content

[Data] Compute Expressions map#58743

Closed
ryankert01 wants to merge 18 commits into
ray-project:masterfrom
ryankert01:feature/expression-map
Closed

[Data] Compute Expressions map#58743
ryankert01 wants to merge 18 commits into
ray-project:masterfrom
ryankert01:feature/expression-map

Conversation

@ryankert01

@ryankert01 ryankert01 commented Nov 18, 2025

Copy link
Copy Markdown
Member

Description

MapNamespace impl.

  • Implemented _extract_map_component as a robust, vectorized fallback since native pc.map_keys kernels are not standard in PyArrow yet.
  • Support: Handles both Logical Maps (MapArray) and Physical Maps (List<Struct>).

Testing

  • test_map_keys / test_map_values: Standard extraction.
  • test_physical_map_extraction: Verifies support for List<Struct>.
  • test_map_sliced_offsets: Verifies the critical fix for sliced data.
  • test_map_nulls_and_empty: Verifies handling of None and empty maps {}.
  • test_map_chaining: Verifies composition with List namespace (e.g., .map.keys().list.len()).

Related issues

Related to #58674

TODO

  • add doc string examples

Additional information

test w/

python -m pytest -v -s python/ray/data/tests/test_namespace_expressions.py::TestMapNamespace
@ryankert01 ryankert01 requested a review from a team as a code owner November 18, 2025 16:08

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for map/dict operations on expression columns by adding a .map property to Expr and implementing the _MapNamespace with keys() and values() methods. The changes in expressions.py are straightforward. My review focuses on the new map_namespace.py file, where I've identified significant performance improvement opportunities. I'm suggesting replacing the current loop-based UDFs with vectorized PyArrow compute functions (pc.map_keys and pc.map_values), which is crucial for efficient data processing. I've also included suggestions for more precise return type inference in these methods.

Comment thread python/ray/data/namespace_expressions/map_namespace.py
Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Structural equality overlooks critical flag.

The AliasExpr.structurally_equals method contains a comparison bug on the last condition: self._is_rename == self._is_rename always evaluates to True. It should compare self._is_rename == other._is_rename to properly check if two AliasExpr instances have the same rename flag, making the structural equality check incomplete.

python/ray/data/expressions.py#L1042-L1047

https://github.com/ray-project/ray/blob/bcb228a2b9308fc7972b38caeb44a16254f6febe/python/ray/data/expressions.py#L1042-L1047

Fix in Cursor Fix in Web


Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
@ryankert01 ryankert01 changed the title [Data]Compute Expressions map Nov 18, 2025
@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 18, 2025
@ryankert01 ryankert01 marked this pull request as draft November 19, 2025 04:46
@ryankert01 ryankert01 marked this pull request as ready for review November 22, 2025 04:02
@ryankert01 ryankert01 marked this pull request as draft November 22, 2025 04:02
@ryankert01 ryankert01 marked this pull request as ready for review November 22, 2025 16:48
@ryankert01 ryankert01 changed the title [WIP] [Data] Compute Expressions map Nov 22, 2025
# offsets and null-mask are preserved from the parent
return pyarrow.ListArray.from_arrays(
offsets=arr.offsets, values=child_array, mask=arr.is_null()
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: MapArray lacks offsets attribute in older PyArrow

The code unconditionally accesses arr.offsets on line 85, but MapArray doesn't have an offsets attribute in PyArrow versions before 7.0.0. When Case 1 executes (line 49) for a MapArray in older PyArrow, the code extracts arr.keys or arr.items but then tries to access arr.offsets which doesn't exist. In PyArrow < 7.0.0, MapArray is not a subclass of ListArray and offsets must be accessed through the buffers API instead.

Fix in Cursor Fix in Web

@ryankert01 ryankert01 marked this pull request as draft November 22, 2025 17:13
@ryankert01 ryankert01 marked this pull request as ready for review November 23, 2025 05:33
@ryankert01

Copy link
Copy Markdown
Member Author

cc @goutamvenkat-anyscale
Want to receive some feedbacks!

ryankert01 and others added 8 commits November 24, 2025 09:19
Signed-off-by: ryankert01 <ryan980053@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
…r PyArrow compatibility

Signed-off-by: ryankert01 <ryan980053@gmail.com>
Signed-off-by: ryankert01 <ryan980053@gmail.com>
Signed-off-by: ryankert01 <ryan980053@gmail.com>
Signed-off-by: ryankert01 <ryan980053@gmail.com>
Comment thread python/ray/data/namespace_expressions/map_namespace.py

@iamjustinhsu iamjustinhsu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing! Left a few comments


@property
def map(self) -> "_MapNamespace":
"""Access map/dict operations for this expression."""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a docstring example like the _StructNamespace?

@ryankert01 ryankert01 Nov 30, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do! List on todos

Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
Comment on lines +137 to +138
if self._expr.data_type.is_arrow_type():
arrow_type = self._expr.data_type.to_arrow_dtype()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit confused by this statement. In english, this sounds like: "Convert to arrow type if already arrow type". Did you mean to use not self._expr.dataset_.is_arrow_type()?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's right to be confused! Maybe I should do some comments. the first is_arrow_type() means check if this is arrow mem. The second to_arrow_dtype() unwrap the ray wrapper and use arrow wrapper.

Comment thread python/ray/data/namespace_expressions/map_namespace.py Outdated
Comment on lines +33 to +35
This handles both formal ``MapArray`` types and ``ListArray<Struct>``
types (which represent maps physically) by accessing underlying
child arrays directly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate what this function does?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to implement a no copy, robust covert from ray wrapper -> pyarrow wrapper.

Signed-off-by: Ryan Huang <ryan980053@gmail.com>
Signed-off-by: ryankert01 <ryan980053@gmail.com>

return pyarrow.ListArray.from_arrays(
offsets=offsets, values=child_array, mask=arr.is_null()
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Sliced array child_array not sliced with normalized offsets

When handling sliced Arrow arrays with non-zero starting offsets, the code normalizes offsets by subtracting start_offset to make them zero-based, but it does not correspondingly slice child_array. Since arr.keys, arr.items, and arr.values return the full underlying flat arrays (due to Arrow's zero-copy slice semantics), using normalized offsets with the full child_array causes each row to reference the wrong elements. For example, a slice of rows 7-9 would incorrectly return data from rows 0-2 of the original array. The child_array needs to be sliced from start_offset to the original offsets[-1] value to match the normalized offsets.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we verify if this is True with a simple test?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have test_map_sliced_offsets in the new pr.

Signed-off-by: ryankert01 <ryan980053@gmail.com>

@iamjustinhsu iamjustinhsu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offsets = arr.offsets
if len(offsets) > 0: # Handle offsets changes
start_offset = offsets[0]
if start_offset.as_py() != 0:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, the start_offset can be > 0 when we recursively go through a chunked array?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, because as a chunked array, each chunk is a slice of the array which has a non-zero offset.

@owenowenisme owenowenisme added the go add ONLY when ready to merge, run all tests label Dec 11, 2025
Comment on lines +57 to +65
if child_array is None:
try:
inner_type = arr.type.value_type
map_type = pyarrow.map_(inner_type[0].type, inner_type[1].type)
return _extract_map_component(arr.cast(map_type), component)
except (AttributeError, IndexError, pyarrow.ArrowInvalid):
raise ValueError(
f"Cannot extract {component} from array of type {arr.type}"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand why we need to recursively cast to a map in non listarray and maparray cases?

Perhaps a concrete example would help.

@ryankert01 ryankert01 Dec 19, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goutamvenkat-anyscale It's an attempt to support unknown mapping, an example is

fixed_list_type = pa.list_(
    pa.struct([('key', pa.string()), ('value', pa.int64())]),
    list_size=2 
)

arr = pa.array([
    [{'key': 'a', 'value': 1}, {'key':  'b', 'value':  2}],
    [{'key': 'x', 'value': 10}, {'key': 'y', 'value': 20}]
], type=fixed_list_type)

feel free to let me know if this handling is necessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify and leave this out for now. If folks have specific workloads that run into this case, we can re-evaluate.

@ryankert01 ryankert01 closed this by deleting the head repository Dec 30, 2025
@ryankert01

Copy link
Copy Markdown
Member Author

Sorry, I re-fork my repo and cause an accidentally close, will open a new one soon.

richardliaw pushed a commit that referenced this pull request Mar 4, 2026
## Description
### `MapNamespace` impl.

- Implemented `_extract_map_component` as a robust, vectorized fallback
since native `pc.map_keys` kernels are not standard in PyArrow yet.
- **Support:** Handles both **Logical Maps** (`MapArray`) and **Physical
Maps** (`List<Struct>`).

### Testing

- `test_map_keys` / `test_map_values`: Standard extraction.
- `test_physical_map_extraction`: Verifies support for `List<Struct>`.
- `test_map_sliced_offsets`: Verifies the critical fix for sliced data.
- `test_map_nulls_and_empty`: Verifies handling of `None` and empty maps
`{}`.
- `test_map_chaining`: Verifies composition with List namespace (e.g.,
`.map.keys().list.len()`).

## Related issues
Related to #58674
Continues #58743


## Additional information

test w/
```
python -m pytest -v -s python/ray/data/tests/test_namespace_expressions.py::TestMapNamespace
```

<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> found 1
potential issue for commit <u>7a11478</u></sup><!-- /BUGBOT_STATUS -->

---------

Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Signed-off-by: Ryan Huang <ryankert01@gmail.com>
Signed-off-by: Hsien-Cheng Huang <hcr@apache.org>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Co-authored-by: Goutam <goutam@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

5 participants