[data] fix bug that _align_struct_fields fails if there are unaligned scalar…#58364
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in _align_struct_fields that caused a failure when processing blocks with missing scalar fields alongside struct fields. The fix correctly replaces an overly strict assertion with logic to pad missing columns with nulls. This change is correct, well-targeted, and resolves the issue described. The implementation is clean and follows standard practices for handling missing data in pyarrow tables.
4857c23 to
cf1d6b3
Compare
|
Thanks @947132885 for your contribution. Can you please add some tests for this change? |
8375796 to
c332403
Compare
|
ok, I added a simple test |
| null_array = pa.nulls(len(block), type=schema.field(column_name).type) | ||
| new_columns.append(null_array) |
There was a problem hiding this comment.
Can you please use BlockAccessor abstraction for this operation?
There was a problem hiding this comment.
Actually can we reuse _backfill_missing_fields for this?
| @@ -552,5 +552,36 @@ def test_arrow_block_max_chunk_size(table_data, max_chunk_size_bytes, expected): | |||
| assert _get_max_chunk_size(table, max_chunk_size_bytes) == expected | |||
|
|
|||
|
|
|||
| def test_arrow_block_concat(): | |||
| def gen_block(table): | |||
There was a problem hiding this comment.
Instead of an E2E test, can we create a unit test for _align_struct_fields for this case?
d8f2a26 to
eb0f84f
Compare
|
The actual issue is that the non-struct columns were not aligned before entering _align_struct_fields, and _align_struct_fields only aligns struct columns. So a better fix is to perform the alignment before calling _align_struct_fields, rather than doing it inside _align_struct_fields itself. I’ve updated this logic in the latest commit, and also replaced the e2e test with a unit test for concat. |
8c2f23c to
072fb22
Compare
b4c3bc0 to
7370b11
Compare
|
@goutamvenkat-anyscale could you take another pass on this? |
7370b11 to
a26392b
Compare
goutamvenkat-anyscale
left a comment
There was a problem hiding this comment.
Please address the remaining comment. Thanks
4d9f7e8 to
e9a8e80
Compare
…end nulls for missing fields, then reorder before struct alignment Signed-off-by: Wang Jingxin <947132885@qq.com>
…ch is much faster Signed-off-by: Wang Jingxin <947132885@qq.com>
…elds Signed-off-by: Wang Jingxin <947132885@qq.com>
…column alignment Signed-off-by: Wang Jingxin <947132885@qq.com>
Signed-off-by: Wang Jingxin <947132885@qq.com>
04df156 to
7b03095
Compare
|
|
||
| # Append missing columns with null values | ||
| # This avoids rebuilding the whole table and only adds what's needed | ||
| if missing_fields: |
There was a problem hiding this comment.
No need for this check, just do the loop
There was a problem hiding this comment.
Also can you help me understand what you're trying to achieve moving adding missing columns in here?
There was a problem hiding this comment.
You're right, I'll update it.
For the missing columns, the _align_struct_fields function is used to add missing fields to struct columns. Since it also takes tables as parameters, it makes sense for it to add missing columns as well. The function name might not be very clear, though.
Also, the logic for filling back missing columns in the concat function is not correct, because it requires a chunked array instead of a plain array.
| else: | ||
| col_chunked_arrays.append(pa.nulls(block.num_rows, type=col_type)) | ||
| # _align_struct_fields guarantees the existence of the column | ||
| col_chunked_arrays = [block.column(col_name) for block in blocks] |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Signed-off-by: Wang Jingxin <947132885@qq.com>
Signed-off-by: 947132885 <947132885@qq.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: 947132885 <947132885@qq.com>
Signed-off-by: Wang Jingxin <947132885@qq.com>
… scalar… (ray-project#58364) ## Description ray fails to merge blocks when there are both unaligned scalar and struct fields. ## Minimal code to reproduce: ``` import pyarrow as pa from ray.data._internal.arrow_ops.transform_pyarrow import concat table1 = pa.table({ "a": [1, 2, 3], "s": [ {"x": 7,}, {"x": 8,}, {"x": 9,}, ], }) table2 = pa.table({ "b": [4, 5, 6], }) concat([table1, table2]) ``` ## Fix: create null array in _align_struct_fields instead of assertion --------- Signed-off-by: Wang Jingxin <947132885@qq.com> Signed-off-by: 947132885 <947132885@qq.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
… scalar… (ray-project#58364) ## Description ray fails to merge blocks when there are both unaligned scalar and struct fields. ## Minimal code to reproduce: ``` import pyarrow as pa from ray.data._internal.arrow_ops.transform_pyarrow import concat table1 = pa.table({ "a": [1, 2, 3], "s": [ {"x": 7,}, {"x": 8,}, {"x": 9,}, ], }) table2 = pa.table({ "b": [4, 5, 6], }) concat([table1, table2]) ``` ## Fix: create null array in _align_struct_fields instead of assertion --------- Signed-off-by: Wang Jingxin <947132885@qq.com> Signed-off-by: 947132885 <947132885@qq.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Description
ray fails to merge blocks when there are both unaligned scalar and struct fields.
Minimal code to reproduce:
Fix:
create null array in _align_struct_fields instead of assertion