Fix flaky test_plasma_unlimited::test_fallback_allocation_failure by ericl · Pull Request #17016 · ray-project/ray

ericl · 2021-07-12T22:01:08Z

Why are these changes needed?

This test was flaky since (1) we didn't pin the objects, which caused them to get spilled, (2) the spill directory was also /dev/shm, which filled up.

ericl · 2021-07-12T22:02:49Z

-                offset = f.tell()
-            except IOError:
-                return keys
+            written_bytes = f.write(payload)


@iycheng @rkooo567 removed this block since it was hiding the exception. Note that it seems we currently retry spilling errors forever, I filed another issue to track this: #17017

scv119 · 2021-07-13T03:10:44Z

    num_exceptions = 0
    refs = []
    for i in range(8):
+        print("Start put", i)


maybe remove this?

I kept it since it was mildly helpful for debugging and not too verbose

## Description `ExternalStorage._write_multiple_objects` builds the spill payload as `header + memoryview(buf)` and then calls `f.write(payload)`. The `+` concatenation allocates a new `bytes` object and copies the whole object buffer into it before writing — an avoidable Python-side memcpy. This PR writes the header and the buffer as two separate writes instead: ```python f.write(header) if buf_len: f.write(memoryview(buf)) ``` `f.write()` is a [buffer protocol][bp] consumer, so `f.write(memoryview(buf))` reads the buffer's memory directly with no copy; `+` is not a consumer — it must materialize a new concatenated object. The on-disk layout is byte-identical, so readers and offset accounting are unchanged. [bp]: https://docs.python.org/3/c-api/buffer.html ## Related issues N/A ## Additional information We instrument one spill and split it into three segments, derived from three raw measurements: - `T_python` — total duration of the Python `spill_objects` call in the IO worker (the whole `external_storage.spill_objects`). Measured by wrapping the delegated call with `perf_counter_ns`: [`external_storage.py` L673-L676][tp]. - `T_io` — time inside `f.write` / `f.flush` within that call (actual disk writes). Accumulated around the two writes ([`external_storage.py` L207-L211][tio-w]) and the flush ([L224-L226][tio-f]). - `T_total` — measured in the raylet C++ with `steady_clock` around the `SpillObjects` RPC, recorded in the reply callback, so it covers the full spill including the RPC and `T_python`: start at [`local_object_manager.cc` L377][tt0], recorded at [L388-L392][tt1]. [tp]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L673-L676 [tio-w]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L207-L211 [tio-f]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L224-L226 [tt0]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L377 [tt1]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L388-L392 From these: - **I/O** = `T_io` — the disk writes. - **Python logic** = `T_python − T_io` — the rest of the Python work: building each object's header, the payload concatenation this PR removes, and URL/offset bookkeeping. (Objects are already serialized in plasma, so no serialization happens here.) - **gRPC + GIL** = `T_total − T_python` — the raylet→IO-worker RPC round trip plus GIL acquisition, i.e. everything outside Python. Durations are measured within each process (`steady_clock` in C++, `perf_counter_ns` in Python) and differenced, so no cross-process clock alignment is needed. Profiling the filesystem spill path (object size swept 64 KB–128 MB), the concatenation copy scales with object size and rivals disk I/O for large objects. After the fix the Python-side time is flat and large-object spill is ~2× faster: | size | T_total before → after | |-------|---------------------------------| | 32MB | 38,911 → 21,300 µs (−45%) | | 64MB | 73,299 → 38,568 µs (−47%) | | 128MB | 525,784 → 256,012 µs (−51%) | Small objects (gRPC+GIL dominated) are unchanged, as expected. **Before / after** (% of spill time: I/O vs gRPC+GIL vs Python logic): ![before](https://github.com/ShockYoungCHN/ray/raw/reform-spill/spill-bench/out/spill_breakdown_before.png) ![after](https://github.com/ShockYoungCHN/ray/raw/reform-spill/spill-bench/out/spill_breakdown_after.png) Benchmark harness and raw data are on the `reform-spill` branch under `spill-bench/`. ### PR history The single concatenated write was not a performance choice and is safe to undo: - **#12087** (fusion, 2020) originally wrote the header and buffer as *separate* `f.write` calls — the same pattern this PR restores. - **#14703** This PR merged them into one `f.write(payload)` so it could wrap the write in `try/except IOError` and use the single write's return value for the recorded size. - **#17016** removed that `try/except` handling. Since then the code is just `written_bytes = f.write(payload); assert written_bytes == payload_len`, so the only thing the concatenation still does is force the avoidable copy. --------- Signed-off-by: Yuanzhuo Yang <yuanzhuoyang@gmail.com> Signed-off-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>

) ## Description `ExternalStorage._write_multiple_objects` builds the spill payload as `header + memoryview(buf)` and then calls `f.write(payload)`. The `+` concatenation allocates a new `bytes` object and copies the whole object buffer into it before writing — an avoidable Python-side memcpy. This PR writes the header and the buffer as two separate writes instead: ```python f.write(header) if buf_len: f.write(memoryview(buf)) ``` `f.write()` is a [buffer protocol][bp] consumer, so `f.write(memoryview(buf))` reads the buffer's memory directly with no copy; `+` is not a consumer — it must materialize a new concatenated object. The on-disk layout is byte-identical, so readers and offset accounting are unchanged. [bp]: https://docs.python.org/3/c-api/buffer.html ## Related issues N/A ## Additional information We instrument one spill and split it into three segments, derived from three raw measurements: - `T_python` — total duration of the Python `spill_objects` call in the IO worker (the whole `external_storage.spill_objects`). Measured by wrapping the delegated call with `perf_counter_ns`: [`external_storage.py` L673-L676][tp]. - `T_io` — time inside `f.write` / `f.flush` within that call (actual disk writes). Accumulated around the two writes ([`external_storage.py` L207-L211][tio-w]) and the flush ([L224-L226][tio-f]). - `T_total` — measured in the raylet C++ with `steady_clock` around the `SpillObjects` RPC, recorded in the reply callback, so it covers the full spill including the RPC and `T_python`: start at [`local_object_manager.cc` L377][tt0], recorded at [L388-L392][tt1]. [tp]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L673-L676 [tio-w]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L207-L211 [tio-f]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L224-L226 [tt0]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L377 [tt1]: https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L388-L392 From these: - **I/O** = `T_io` — the disk writes. - **Python logic** = `T_python − T_io` — the rest of the Python work: building each object's header, the payload concatenation this PR removes, and URL/offset bookkeeping. (Objects are already serialized in plasma, so no serialization happens here.) - **gRPC + GIL** = `T_total − T_python` — the raylet→IO-worker RPC round trip plus GIL acquisition, i.e. everything outside Python. Durations are measured within each process (`steady_clock` in C++, `perf_counter_ns` in Python) and differenced, so no cross-process clock alignment is needed. Profiling the filesystem spill path (object size swept 64 KB–128 MB), the concatenation copy scales with object size and rivals disk I/O for large objects. After the fix the Python-side time is flat and large-object spill is ~2× faster: | size | T_total before → after | |-------|---------------------------------| | 32MB | 38,911 → 21,300 µs (−45%) | | 64MB | 73,299 → 38,568 µs (−47%) | | 128MB | 525,784 → 256,012 µs (−51%) | Small objects (gRPC+GIL dominated) are unchanged, as expected. **Before / after** (% of spill time: I/O vs gRPC+GIL vs Python logic): ![before](https://github.com/ShockYoungCHN/ray/raw/reform-spill/spill-bench/out/spill_breakdown_before.png) ![after](https://github.com/ShockYoungCHN/ray/raw/reform-spill/spill-bench/out/spill_breakdown_after.png) Benchmark harness and raw data are on the `reform-spill` branch under `spill-bench/`. ### PR history The single concatenated write was not a performance choice and is safe to undo: - **ray-project#12087** (fusion, 2020) originally wrote the header and buffer as *separate* `f.write` calls — the same pattern this PR restores. - **ray-project#14703** This PR merged them into one `f.write(payload)` so it could wrap the write in `try/except IOError` and use the single write's return value for the recorded size. - **ray-project#17016** removed that `try/except` handling. Since then the code is just `written_bytes = f.write(payload); assert written_bytes == payload_len`, so the only thing the concatenation still does is force the avoidable copy. --------- Signed-off-by: Yuanzhuo Yang <yuanzhuoyang@gmail.com> Signed-off-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>

ericl added 2 commits July 12, 2021 14:54

fix

179b5db

fix catch

4a7675b

ericl commented Jul 12, 2021

View reviewed changes

ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 12, 2021

ericl assigned scv119 and rkooo567 Jul 13, 2021

scv119 approved these changes Jul 13, 2021

View reviewed changes

scv119 reviewed Jul 13, 2021

View reviewed changes

ericl merged commit e7350ff into master Jul 13, 2021

ericl deleted the deflake-fall branch July 13, 2021 03:17

ShockYoungCHN mentioned this pull request May 27, 2026

[core] Avoid extra memcpy when spilling fused objects #63653

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flaky test_plasma_unlimited::test_fallback_allocation_failure#17016

Fix flaky test_plasma_unlimited::test_fallback_allocation_failure#17016
ericl merged 2 commits into
masterfrom
deflake-fall

ericl commented Jul 12, 2021

ericl Jul 12, 2021

scv119 Jul 13, 2021

ericl Jul 13, 2021

Labels

3 participants

Uh oh!

Conversation

ericl commented Jul 12, 2021

Why are these changes needed?

ericl Jul 12, 2021

Choose a reason for hiding this comment

scv119 Jul 13, 2021

Choose a reason for hiding this comment

ericl Jul 13, 2021

Choose a reason for hiding this comment

Labels

3 participants