Skip to content

[data] fix bug that _align_struct_fields fails if there are unaligned scalar…#58364

Merged
alexeykudinkin merged 13 commits into
ray-project:masterfrom
947132885:data_concat_bugfix
Feb 18, 2026
Merged

[data] fix bug that _align_struct_fields fails if there are unaligned scalar…#58364
alexeykudinkin merged 13 commits into
ray-project:masterfrom
947132885:data_concat_bugfix

Conversation

@947132885

@947132885 947132885 commented Nov 3, 2025

Copy link
Copy Markdown
Contributor

Description

ray fails to merge blocks when there are both unaligned scalar and struct fields.

Minimal code to reproduce:

import pyarrow as pa
from ray.data._internal.arrow_ops.transform_pyarrow import concat

table1 = pa.table({
    "a": [1, 2, 3],
    "s": [
        {"x": 7,},
        {"x": 8,},
        {"x": 9,},
    ],
})

table2 = pa.table({
    "b": [4, 5, 6],
})
concat([table1, table2])

Fix:

create null array in _align_struct_fields instead of assertion

@947132885 947132885 requested a review from a team as a code owner November 3, 2025 02:51

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in _align_struct_fields that caused a failure when processing blocks with missing scalar fields alongside struct fields. The fix correctly replaces an overly strict assertion with logic to pad missing columns with nulls. This change is correct, well-targeted, and resolves the issue described. The implementation is clean and follows standard practices for handling missing data in pyarrow tables.

@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 3, 2025
@goutamvenkat-anyscale

Copy link
Copy Markdown
Contributor

Thanks @947132885 for your contribution. Can you please add some tests for this change?

@947132885 947132885 force-pushed the data_concat_bugfix branch 2 times, most recently from 8375796 to c332403 Compare November 3, 2025 11:58
@947132885

Copy link
Copy Markdown
Contributor Author

ok, I added a simple test

Comment on lines +568 to +569
null_array = pa.nulls(len(block), type=schema.field(column_name).type)
new_columns.append(null_array)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please use BlockAccessor abstraction for this operation?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually can we reuse _backfill_missing_fields for this?

@@ -552,5 +552,36 @@ def test_arrow_block_max_chunk_size(table_data, max_chunk_size_bytes, expected):
assert _get_max_chunk_size(table, max_chunk_size_bytes) == expected


def test_arrow_block_concat():
def gen_block(table):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of an E2E test, can we create a unit test for _align_struct_fields for this case?

@gvspraveen gvspraveen added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Nov 21, 2025
Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated
@947132885

Copy link
Copy Markdown
Contributor Author

The actual issue is that the non-struct columns were not aligned before entering _align_struct_fields, and _align_struct_fields only aligns struct columns. So a better fix is to perform the alignment before calling _align_struct_fields, rather than doing it inside _align_struct_fields itself. I’ve updated this logic in the latest commit, and also replaced the e2e test with a unit test for concat.

Comment thread python/ray/data/tests/test_arrow_block.py Outdated
Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated
@947132885 947132885 force-pushed the data_concat_bugfix branch 3 times, most recently from b4c3bc0 to 7370b11 Compare November 22, 2025 17:09
@bveeramani

Copy link
Copy Markdown
Member

@goutamvenkat-anyscale could you take another pass on this?

Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated
@bveeramani bveeramani removed their assignment Nov 27, 2025
Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated

@goutamvenkat-anyscale goutamvenkat-anyscale left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the remaining comment. Thanks

…end nulls for missing fields, then reorder before struct alignment

Signed-off-by: Wang Jingxin <947132885@qq.com>
…ch is much faster

Signed-off-by: Wang Jingxin <947132885@qq.com>
…elds

Signed-off-by: Wang Jingxin <947132885@qq.com>
…column alignment

Signed-off-by: Wang Jingxin <947132885@qq.com>
Signed-off-by: Wang Jingxin <947132885@qq.com>
@github-actions github-actions Bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Dec 19, 2025
@gvspraveen gvspraveen removed @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. labels Jan 7, 2026
Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated

# Append missing columns with null values
# This avoids rebuilding the whole table and only adds what's needed
if missing_fields:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this check, just do the loop

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you help me understand what you're trying to achieve moving adding missing columns in here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I'll update it.

For the missing columns, the _align_struct_fields function is used to add missing fields to struct columns. Since it also takes tables as parameters, it makes sense for it to add missing columns as well. The function name might not be very clear, though.

Also, the logic for filling back missing columns in the concat function is not correct, because it requires a chunked array instead of a plain array.

else:
col_chunked_arrays.append(pa.nulls(block.num_rows, type=col_type))
# _align_struct_fields guarantees the existence of the column
col_chunked_arrays = [block.column(col_name) for block in blocks]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@947132885 that's a good catch!

@github-actions

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 22, 2026
947132885 and others added 3 commits February 3, 2026 12:13
Signed-off-by: Wang Jingxin <947132885@qq.com>
Signed-off-by: 947132885 <947132885@qq.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: 947132885 <947132885@qq.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated
Signed-off-by: Wang Jingxin <947132885@qq.com>
@github-actions github-actions Bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Feb 3, 2026
@alexeykudinkin alexeykudinkin merged commit 1bd59fe into ray-project:master Feb 18, 2026
6 checks passed
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
… scalar… (ray-project#58364)

## Description
ray fails to merge blocks when there are both unaligned scalar and
struct fields.

## Minimal code to reproduce:

```
import pyarrow as pa
from ray.data._internal.arrow_ops.transform_pyarrow import concat

table1 = pa.table({
    "a": [1, 2, 3],
    "s": [
        {"x": 7,},
        {"x": 8,},
        {"x": 9,},
    ],
})

table2 = pa.table({
    "b": [4, 5, 6],
})
concat([table1, table2])
```

## Fix:
create null array in _align_struct_fields instead of assertion

---------

Signed-off-by: Wang Jingxin <947132885@qq.com>
Signed-off-by: 947132885 <947132885@qq.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
Aydin-ab pushed a commit to kunling-anyscale/ray that referenced this pull request Feb 20, 2026
… scalar… (ray-project#58364)

## Description
ray fails to merge blocks when there are both unaligned scalar and
struct fields.

## Minimal code to reproduce:

```
import pyarrow as pa
from ray.data._internal.arrow_ops.transform_pyarrow import concat

table1 = pa.table({
    "a": [1, 2, 3],
    "s": [
        {"x": 7,},
        {"x": 8,},
        {"x": 9,},
    ],
})

table2 = pa.table({
    "b": [4, 5, 6],
})
concat([table1, table2])
```

## Fix:
create null array in _align_struct_fields instead of assertion

---------

Signed-off-by: Wang Jingxin <947132885@qq.com>
Signed-off-by: 947132885 <947132885@qq.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

6 participants