Fix flaky test_plasma_unlimited::test_fallback_allocation_failure#17016
Merged
Conversation
ericl
commented
Jul 12, 2021
| offset = f.tell() | ||
| except IOError: | ||
| return keys | ||
| written_bytes = f.write(payload) |
Contributor
Author
scv119
approved these changes
Jul 13, 2021
scv119
reviewed
Jul 13, 2021
| num_exceptions = 0 | ||
| refs = [] | ||
| for i in range(8): | ||
| print("Start put", i) |
Contributor
Author
There was a problem hiding this comment.
I kept it since it was mildly helpful for debugging and not too verbose
edoakes
pushed a commit
that referenced
this pull request
May 28, 2026
## Description
`ExternalStorage._write_multiple_objects` builds the spill payload as
`header + memoryview(buf)` and then calls `f.write(payload)`. The `+`
concatenation allocates a new `bytes` object and copies the whole object
buffer into it before writing — an avoidable Python-side memcpy.
This PR writes the header and the buffer as two separate writes instead:
```python
f.write(header)
if buf_len:
f.write(memoryview(buf))
```
`f.write()` is a [buffer protocol][bp] consumer, so
`f.write(memoryview(buf))` reads the buffer's memory directly with no
copy; `+` is not a consumer — it must materialize a new concatenated
object. The on-disk layout is byte-identical, so readers and offset
accounting are unchanged.
[bp]: https://docs.python.org/3/c-api/buffer.html
## Related issues
N/A
## Additional information
We instrument one spill and split it into three segments, derived from
three raw measurements:
- `T_python` — total duration of the Python `spill_objects` call in the
IO worker (the whole `external_storage.spill_objects`). Measured by
wrapping the delegated call with `perf_counter_ns`:
[`external_storage.py` L673-L676][tp].
- `T_io` — time inside `f.write` / `f.flush` within that call (actual
disk writes). Accumulated around the two writes ([`external_storage.py`
L207-L211][tio-w]) and the flush ([L224-L226][tio-f]).
- `T_total` — measured in the raylet C++ with `steady_clock` around the
`SpillObjects` RPC, recorded in the reply callback, so it covers the
full spill including the RPC and `T_python`: start at
[`local_object_manager.cc` L377][tt0], recorded at [L388-L392][tt1].
[tp]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L673-L676
[tio-w]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L207-L211
[tio-f]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L224-L226
[tt0]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L377
[tt1]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L388-L392
From these:
- **I/O** = `T_io` — the disk writes.
- **Python logic** = `T_python − T_io` — the rest of the Python work:
building each object's header, the payload concatenation this PR
removes, and URL/offset bookkeeping. (Objects are already serialized in
plasma, so no serialization happens here.)
- **gRPC + GIL** = `T_total − T_python` — the raylet→IO-worker RPC round
trip plus GIL acquisition, i.e. everything outside Python.
Durations are measured within each process (`steady_clock` in C++,
`perf_counter_ns` in Python) and differenced, so no cross-process clock
alignment is needed.
Profiling the filesystem spill path (object size swept 64 KB–128 MB),
the concatenation copy scales with object size and rivals disk I/O for
large objects. After the fix the Python-side time is flat and
large-object spill is ~2× faster:
| size | T_total before → after |
|-------|---------------------------------|
| 32MB | 38,911 → 21,300 µs (−45%) |
| 64MB | 73,299 → 38,568 µs (−47%) |
| 128MB | 525,784 → 256,012 µs (−51%) |
Small objects (gRPC+GIL dominated) are unchanged, as expected.
**Before / after** (% of spill time: I/O vs gRPC+GIL vs Python logic):


Benchmark harness and raw data are on the `reform-spill` branch under
`spill-bench/`.
### PR history
The single concatenated write was not a performance choice and is safe
to undo:
- **#12087** (fusion, 2020) originally wrote the header and buffer as
*separate* `f.write` calls — the same pattern this PR restores.
- **#14703** This PR merged them into one `f.write(payload)` so it could
wrap the write in `try/except IOError` and use the single write's return
value for the recorded size.
- **#17016** removed that `try/except` handling. Since then the code is
just `written_bytes = f.write(payload); assert written_bytes ==
payload_len`, so the only thing the concatenation still does is force
the avoidable copy.
---------
Signed-off-by: Yuanzhuo Yang <yuanzhuoyang@gmail.com>
Signed-off-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>
rueian
pushed a commit
to rueian/ray
that referenced
this pull request
Jun 4, 2026
) ## Description `ExternalStorage._write_multiple_objects` builds the spill payload as `header + memoryview(buf)` and then calls `f.write(payload)`. The `+` concatenation allocates a new `bytes` object and copies the whole object buffer into it before writing — an avoidable Python-side memcpy. This PR writes the header and the buffer as two separate writes instead: ```python f.write(header) if buf_len: f.write(memoryview(buf)) ``` `f.write()` is a [buffer protocol][bp] consumer, so `f.write(memoryview(buf))` reads the buffer's memory directly with no copy; `+` is not a consumer — it must materialize a new concatenated object. The on-disk layout is byte-identical, so readers and offset accounting are unchanged. [bp]: https://docs.python.org/3/c-api/buffer.html ## Related issues N/A ## Additional information We instrument one spill and split it into three segments, derived from three raw measurements: - `T_python` — total duration of the Python `spill_objects` call in the IO worker (the whole `external_storage.spill_objects`). Measured by wrapping the delegated call with `perf_counter_ns`: [`external_storage.py` L673-L676][tp]. - `T_io` — time inside `f.write` / `f.flush` within that call (actual disk writes). Accumulated around the two writes ([`external_storage.py` L207-L211][tio-w]) and the flush ([L224-L226][tio-f]). - `T_total` — measured in the raylet C++ with `steady_clock` around the `SpillObjects` RPC, recorded in the reply callback, so it covers the full spill including the RPC and `T_python`: start at [`local_object_manager.cc` L377][tt0], recorded at [L388-L392][tt1]. [tp]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L673-L676 [tio-w]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L207-L211 [tio-f]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L224-L226 [tt0]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L377 [tt1]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L388-L392 From these: - **I/O** = `T_io` — the disk writes. - **Python logic** = `T_python − T_io` — the rest of the Python work: building each object's header, the payload concatenation this PR removes, and URL/offset bookkeeping. (Objects are already serialized in plasma, so no serialization happens here.) - **gRPC + GIL** = `T_total − T_python` — the raylet→IO-worker RPC round trip plus GIL acquisition, i.e. everything outside Python. Durations are measured within each process (`steady_clock` in C++, `perf_counter_ns` in Python) and differenced, so no cross-process clock alignment is needed. Profiling the filesystem spill path (object size swept 64 KB–128 MB), the concatenation copy scales with object size and rivals disk I/O for large objects. After the fix the Python-side time is flat and large-object spill is ~2× faster: | size | T_total before → after | |-------|---------------------------------| | 32MB | 38,911 → 21,300 µs (−45%) | | 64MB | 73,299 → 38,568 µs (−47%) | | 128MB | 525,784 → 256,012 µs (−51%) | Small objects (gRPC+GIL dominated) are unchanged, as expected. **Before / after** (% of spill time: I/O vs gRPC+GIL vs Python logic):   Benchmark harness and raw data are on the `reform-spill` branch under `spill-bench/`. ### PR history The single concatenated write was not a performance choice and is safe to undo: - **ray-project#12087** (fusion, 2020) originally wrote the header and buffer as *separate* `f.write` calls — the same pattern this PR restores. - **ray-project#14703** This PR merged them into one `f.write(payload)` so it could wrap the write in `try/except IOError` and use the single write's return value for the recorded size. - **ray-project#17016** removed that `try/except` handling. Since then the code is just `written_bytes = f.write(payload); assert written_bytes == payload_len`, so the only thing the concatenation still does is force the avoidable copy. --------- Signed-off-by: Yuanzhuo Yang <yuanzhuoyang@gmail.com> Signed-off-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>
limarkdcunha
pushed a commit
to limarkdcunha/ray
that referenced
this pull request
Jun 30, 2026
) ## Description `ExternalStorage._write_multiple_objects` builds the spill payload as `header + memoryview(buf)` and then calls `f.write(payload)`. The `+` concatenation allocates a new `bytes` object and copies the whole object buffer into it before writing — an avoidable Python-side memcpy. This PR writes the header and the buffer as two separate writes instead: ```python f.write(header) if buf_len: f.write(memoryview(buf)) ``` `f.write()` is a [buffer protocol][bp] consumer, so `f.write(memoryview(buf))` reads the buffer's memory directly with no copy; `+` is not a consumer — it must materialize a new concatenated object. The on-disk layout is byte-identical, so readers and offset accounting are unchanged. [bp]: https://docs.python.org/3/c-api/buffer.html ## Related issues N/A ## Additional information We instrument one spill and split it into three segments, derived from three raw measurements: - `T_python` — total duration of the Python `spill_objects` call in the IO worker (the whole `external_storage.spill_objects`). Measured by wrapping the delegated call with `perf_counter_ns`: [`external_storage.py` L673-L676][tp]. - `T_io` — time inside `f.write` / `f.flush` within that call (actual disk writes). Accumulated around the two writes ([`external_storage.py` L207-L211][tio-w]) and the flush ([L224-L226][tio-f]). - `T_total` — measured in the raylet C++ with `steady_clock` around the `SpillObjects` RPC, recorded in the reply callback, so it covers the full spill including the RPC and `T_python`: start at [`local_object_manager.cc` L377][tt0], recorded at [L388-L392][tt1]. [tp]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L673-L676 [tio-w]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L207-L211 [tio-f]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L224-L226 [tt0]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L377 [tt1]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L388-L392 From these: - **I/O** = `T_io` — the disk writes. - **Python logic** = `T_python − T_io` — the rest of the Python work: building each object's header, the payload concatenation this PR removes, and URL/offset bookkeeping. (Objects are already serialized in plasma, so no serialization happens here.) - **gRPC + GIL** = `T_total − T_python` — the raylet→IO-worker RPC round trip plus GIL acquisition, i.e. everything outside Python. Durations are measured within each process (`steady_clock` in C++, `perf_counter_ns` in Python) and differenced, so no cross-process clock alignment is needed. Profiling the filesystem spill path (object size swept 64 KB–128 MB), the concatenation copy scales with object size and rivals disk I/O for large objects. After the fix the Python-side time is flat and large-object spill is ~2× faster: | size | T_total before → after | |-------|---------------------------------| | 32MB | 38,911 → 21,300 µs (−45%) | | 64MB | 73,299 → 38,568 µs (−47%) | | 128MB | 525,784 → 256,012 µs (−51%) | Small objects (gRPC+GIL dominated) are unchanged, as expected. **Before / after** (% of spill time: I/O vs gRPC+GIL vs Python logic):   Benchmark harness and raw data are on the `reform-spill` branch under `spill-bench/`. ### PR history The single concatenated write was not a performance choice and is safe to undo: - **ray-project#12087** (fusion, 2020) originally wrote the header and buffer as *separate* `f.write` calls — the same pattern this PR restores. - **ray-project#14703** This PR merged them into one `f.write(payload)` so it could wrap the write in `try/except IOError` and use the single write's return value for the recorded size. - **ray-project#17016** removed that `try/except` handling. Since then the code is just `written_bytes = f.write(payload); assert written_bytes == payload_len`, so the only thing the concatenation still does is force the avoidable copy. --------- Signed-off-by: Yuanzhuo Yang <yuanzhuoyang@gmail.com> Signed-off-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
This test was flaky since (1) we didn't pin the objects, which caused them to get spilled, (2) the spill directory was also /dev/shm, which filled up.