[core] Avoid extra memcpy when spilling fused objects by ShockYoungCHN · Pull Request #63653 · ray-project/ray

ShockYoungCHN · 2026-05-27T00:16:45Z

Description

ExternalStorage._write_multiple_objects builds the spill payload as header + memoryview(buf) and then calls f.write(payload). The + concatenation allocates a new bytes object and copies the whole object buffer into it before writing — an avoidable Python-side memcpy.

This PR writes the header and the buffer as two separate writes instead:

f.write(header)
if buf_len:
    f.write(memoryview(buf))

f.write() is a buffer protocol consumer, so f.write(memoryview(buf)) reads the buffer's memory directly with no copy; + is not a consumer — it must materialize a new concatenated object. The on-disk layout is byte-identical, so readers and offset accounting are unchanged.

Related issues

N/A

Additional information

We instrument one spill and split it into three segments, derived from three raw measurements:

T_python — total duration of the Python spill_objects call in the IO worker (the whole external_storage.spill_objects). Measured by wrapping the delegated call with perf_counter_ns: external_storage.py L673-L676.
T_io — time inside f.write / f.flush within that call (actual disk writes). Accumulated around the two writes (external_storage.py L207-L211) and the flush (L224-L226).
T_total — measured in the raylet C++ with steady_clock around the SpillObjects RPC, recorded in the reply callback, so it covers the full spill including the RPC and T_python: start at local_object_manager.cc L377, recorded at L388-L392.

From these:

I/O = T_io — the disk writes.
Python logic = T_python − T_io — the rest of the Python work: building each object's header, the payload concatenation this PR removes, and URL/offset bookkeeping. (Objects are already serialized in plasma, so no serialization happens here.)
gRPC + GIL = T_total − T_python — the raylet→IO-worker RPC round trip plus GIL acquisition, i.e. everything outside Python.

Durations are measured within each process (steady_clock in C++, perf_counter_ns in Python) and differenced, so no cross-process clock alignment is needed.

Profiling the filesystem spill path (object size swept 64 KB–128 MB), the concatenation copy scales with object size and rivals disk I/O for large objects. After the fix the Python-side time is flat and large-object spill is ~2× faster:

size	T_total before → after
32MB	38,911 → 21,300 µs (−45%)
64MB	73,299 → 38,568 µs (−47%)
128MB	525,784 → 256,012 µs (−51%)

Small objects (gRPC+GIL dominated) are unchanged, as expected.

Before / after (% of spill time: I/O vs gRPC+GIL vs Python logic):

Benchmark harness and raw data are on the reform-spill branch under spill-bench/.

PR history

The single concatenated write was not a performance choice and is safe to undo:

[Object Spilling] Fusion small objects #12087 (fusion, 2020) originally wrote the header and buffer as separate f.write calls — the same pattern this PR restores.
[core] Take care of object spilling failure #14703 This PR merged them into one f.write(payload) so it could wrap the write in try/except IOError and use the single write's return value for the recorded size.
Fix flaky test_plasma_unlimited::test_fallback_allocation_failure #17016 removed that try/except handling. Since then the code is just written_bytes = f.write(payload); assert written_bytes == payload_len, so the only thing the concatenation still does is force the avoidable copy.

_write_multiple_objects concatenated the header and the object buffer into one bytes object before f.write(), which memcpy'd the entire payload. For large objects this Python-side copy rivaled the actual disk I/O (~273ms vs ~251ms of pylogic vs io at 128MB). Write the small fixed header and the buffer as two separate writes instead; the on-disk layout is byte-identical so readers and offset accounting are unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Yuanzhuo Yang <yuanzhuoyang@gmail.com>

gemini-code-assist

Code Review

This pull request optimizes the _write_multiple_objects function in python/ray/_private/external_storage.py by splitting the write operation into two separate writes: one for the header and one for the object buffer. This avoids an expensive memory copy of the entire object payload that would occur when concatenating them. There are no review comments to address, and I have no additional feedback to provide.

Kunchd

Looks good. Thanks for the insightful PR description! I left a small nit.

Kunchd · 2026-05-27T20:14:05Z

-            payload = (
+            # Write the small fixed header (lengths + owner address + metadata)
+            # and the object buffer as two separate writes. Concatenating them
+            # into one bytes object would memcpy the entire object payload.


nit: The comment here is unnecessary. Comments should only clarify or justify the code for future readers. Future readers wouldn't see the concatenation option, and likely wouldn't question why we don't concatenate here.

However, the comment is helpful for code review. I typically just leave a comment in the PR for these clarifications.

Signed-off-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>

) ## Description `ExternalStorage._write_multiple_objects` builds the spill payload as `header + memoryview(buf)` and then calls `f.write(payload)`. The `+` concatenation allocates a new `bytes` object and copies the whole object buffer into it before writing — an avoidable Python-side memcpy. This PR writes the header and the buffer as two separate writes instead: ```python f.write(header) if buf_len: f.write(memoryview(buf)) ``` `f.write()` is a [buffer protocol][bp] consumer, so `f.write(memoryview(buf))` reads the buffer's memory directly with no copy; `+` is not a consumer — it must materialize a new concatenated object. The on-disk layout is byte-identical, so readers and offset accounting are unchanged. [bp]: https://docs.python.org/3/c-api/buffer.html ## Related issues N/A ## Additional information We instrument one spill and split it into three segments, derived from three raw measurements: - `T_python` — total duration of the Python `spill_objects` call in the IO worker (the whole `external_storage.spill_objects`). Measured by wrapping the delegated call with `perf_counter_ns`: [`external_storage.py` L673-L676][tp]. - `T_io` — time inside `f.write` / `f.flush` within that call (actual disk writes). Accumulated around the two writes ([`external_storage.py` L207-L211][tio-w]) and the flush ([L224-L226][tio-f]). - `T_total` — measured in the raylet C++ with `steady_clock` around the `SpillObjects` RPC, recorded in the reply callback, so it covers the full spill including the RPC and `T_python`: start at [`local_object_manager.cc` L377][tt0], recorded at [L388-L392][tt1]. [tp]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L673-L676 [tio-w]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L207-L211 [tio-f]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L224-L226 [tt0]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L377 [tt1]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L388-L392 From these: - **I/O** = `T_io` — the disk writes. - **Python logic** = `T_python − T_io` — the rest of the Python work: building each object's header, the payload concatenation this PR removes, and URL/offset bookkeeping. (Objects are already serialized in plasma, so no serialization happens here.) - **gRPC + GIL** = `T_total − T_python` — the raylet→IO-worker RPC round trip plus GIL acquisition, i.e. everything outside Python. Durations are measured within each process (`steady_clock` in C++, `perf_counter_ns` in Python) and differenced, so no cross-process clock alignment is needed. Profiling the filesystem spill path (object size swept 64 KB–128 MB), the concatenation copy scales with object size and rivals disk I/O for large objects. After the fix the Python-side time is flat and large-object spill is ~2× faster: | size | T_total before → after | |-------|---------------------------------| | 32MB | 38,911 → 21,300 µs (−45%) | | 64MB | 73,299 → 38,568 µs (−47%) | | 128MB | 525,784 → 256,012 µs (−51%) | Small objects (gRPC+GIL dominated) are unchanged, as expected. **Before / after** (% of spill time: I/O vs gRPC+GIL vs Python logic): ![before](https://github.com/ShockYoungCHN/ray/raw/reform-spill/spill-bench/out/spill_breakdown_before.png) ![after](https://github.com/ShockYoungCHN/ray/raw/reform-spill/spill-bench/out/spill_breakdown_after.png) Benchmark harness and raw data are on the `reform-spill` branch under `spill-bench/`. ### PR history The single concatenated write was not a performance choice and is safe to undo: - **ray-project#12087** (fusion, 2020) originally wrote the header and buffer as *separate* `f.write` calls — the same pattern this PR restores. - **ray-project#14703** This PR merged them into one `f.write(payload)` so it could wrap the write in `try/except IOError` and use the single write's return value for the recorded size. - **ray-project#17016** removed that `try/except` handling. Since then the code is just `written_bytes = f.write(payload); assert written_bytes == payload_len`, so the only thing the concatenation still does is force the avoidable copy. --------- Signed-off-by: Yuanzhuo Yang <yuanzhuoyang@gmail.com> Signed-off-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>

ShockYoungCHN requested a review from a team as a code owner May 27, 2026 00:16

ShockYoungCHN assigned ShockYoungCHN and Kunchd and unassigned ShockYoungCHN May 27, 2026

ShockYoungCHN added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels May 27, 2026

ShockYoungCHN force-pushed the spill-memcpy-fix branch from 9c7c5ea to f281e42 Compare May 27, 2026 00:19

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

Kunchd reviewed May 27, 2026

View reviewed changes

[Core] Remove unnecessary comment in external_storage spill write

90f0e39

Signed-off-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>

edoakes approved these changes May 28, 2026

View reviewed changes

edoakes merged commit 889186b into ray-project:master May 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Avoid extra memcpy when spilling fused objects#63653

[core] Avoid extra memcpy when spilling fused objects#63653
edoakes merged 2 commits into
ray-project:masterfrom
ShockYoungCHN:spill-memcpy-fix

ShockYoungCHN commented May 27, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Kunchd left a comment

Kunchd May 27, 2026

Uh oh!

Labels

4 participants

Uh oh!

Conversation

ShockYoungCHN commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

PR history

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Kunchd left a comment

Choose a reason for hiding this comment

Kunchd May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants

ShockYoungCHN commented May 27, 2026 •

edited

Loading