Skip to content

Fix flaky test_plasma_unlimited::test_fallback_allocation_failure#17016

Merged
ericl merged 2 commits into
masterfrom
deflake-fall
Jul 13, 2021
Merged

Fix flaky test_plasma_unlimited::test_fallback_allocation_failure#17016
ericl merged 2 commits into
masterfrom
deflake-fall

Conversation

@ericl

@ericl ericl commented Jul 12, 2021

Copy link
Copy Markdown
Contributor

Why are these changes needed?

This test was flaky since (1) we didn't pin the objects, which caused them to get spilled, (2) the spill directory was also /dev/shm, which filled up.

offset = f.tell()
except IOError:
return keys
written_bytes = f.write(payload)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iycheng @rkooo567 removed this block since it was hiding the exception. Note that it seems we currently retry spilling errors forever, I filed another issue to track this: #17017

@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 12, 2021
num_exceptions = 0
refs = []
for i in range(8):
print("Start put", i)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it since it was mildly helpful for debugging and not too verbose

@ericl ericl merged commit e7350ff into master Jul 13, 2021
@ericl ericl deleted the deflake-fall branch July 13, 2021 03:17
edoakes pushed a commit that referenced this pull request May 28, 2026
## Description

`ExternalStorage._write_multiple_objects` builds the spill payload as
`header + memoryview(buf)` and then calls `f.write(payload)`. The `+`
concatenation allocates a new `bytes` object and copies the whole object
buffer into it before writing — an avoidable Python-side memcpy.

This PR writes the header and the buffer as two separate writes instead:

```python
f.write(header)
if buf_len:
    f.write(memoryview(buf))
```

`f.write()` is a [buffer protocol][bp] consumer, so
`f.write(memoryview(buf))` reads the buffer's memory directly with no
copy; `+` is not a consumer — it must materialize a new concatenated
object. The on-disk layout is byte-identical, so readers and offset
accounting are unchanged.

[bp]: https://docs.python.org/3/c-api/buffer.html

## Related issues

N/A

## Additional information

We instrument one spill and split it into three segments, derived from
three raw measurements:

- `T_python` — total duration of the Python `spill_objects` call in the
IO worker (the whole `external_storage.spill_objects`). Measured by
wrapping the delegated call with `perf_counter_ns`:
[`external_storage.py` L673-L676][tp].
- `T_io` — time inside `f.write` / `f.flush` within that call (actual
disk writes). Accumulated around the two writes ([`external_storage.py`
L207-L211][tio-w]) and the flush ([L224-L226][tio-f]).
- `T_total` — measured in the raylet C++ with `steady_clock` around the
`SpillObjects` RPC, recorded in the reply callback, so it covers the
full spill including the RPC and `T_python`: start at
[`local_object_manager.cc` L377][tt0], recorded at [L388-L392][tt1].

[tp]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L673-L676
[tio-w]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L207-L211
[tio-f]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L224-L226
[tt0]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L377
[tt1]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L388-L392

From these:

- **I/O** = `T_io` — the disk writes.
- **Python logic** = `T_python − T_io` — the rest of the Python work:
building each object's header, the payload concatenation this PR
removes, and URL/offset bookkeeping. (Objects are already serialized in
plasma, so no serialization happens here.)
- **gRPC + GIL** = `T_total − T_python` — the raylet→IO-worker RPC round
trip plus GIL acquisition, i.e. everything outside Python.

Durations are measured within each process (`steady_clock` in C++,
`perf_counter_ns` in Python) and differenced, so no cross-process clock
alignment is needed.

Profiling the filesystem spill path (object size swept 64 KB–128 MB),
the concatenation copy scales with object size and rivals disk I/O for
large objects. After the fix the Python-side time is flat and
large-object spill is ~2× faster:

| size  | T_total before → after          |
|-------|---------------------------------|
| 32MB  | 38,911 → 21,300 µs (−45%)       |
| 64MB  | 73,299 → 38,568 µs (−47%)       |
| 128MB | 525,784 → 256,012 µs (−51%)     |

Small objects (gRPC+GIL dominated) are unchanged, as expected.

**Before / after** (% of spill time: I/O vs gRPC+GIL vs Python logic):


![before](https://github.com/ShockYoungCHN/ray/raw/reform-spill/spill-bench/out/spill_breakdown_before.png)

![after](https://github.com/ShockYoungCHN/ray/raw/reform-spill/spill-bench/out/spill_breakdown_after.png)

Benchmark harness and raw data are on the `reform-spill` branch under
`spill-bench/`.

### PR history

The single concatenated write was not a performance choice and is safe
to undo:

- **#12087** (fusion, 2020) originally wrote the header and buffer as
*separate* `f.write` calls — the same pattern this PR restores.
- **#14703** This PR merged them into one `f.write(payload)` so it could
wrap the write in `try/except IOError` and use the single write's return
value for the recorded size.
- **#17016** removed that `try/except` handling. Since then the code is
just `written_bytes = f.write(payload); assert written_bytes ==
payload_len`, so the only thing the concatenation still does is force
the avoidable copy.

---------

Signed-off-by: Yuanzhuo Yang <yuanzhuoyang@gmail.com>
Signed-off-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
)

## Description

`ExternalStorage._write_multiple_objects` builds the spill payload as
`header + memoryview(buf)` and then calls `f.write(payload)`. The `+`
concatenation allocates a new `bytes` object and copies the whole object
buffer into it before writing — an avoidable Python-side memcpy.

This PR writes the header and the buffer as two separate writes instead:

```python
f.write(header)
if buf_len:
    f.write(memoryview(buf))
```

`f.write()` is a [buffer protocol][bp] consumer, so
`f.write(memoryview(buf))` reads the buffer's memory directly with no
copy; `+` is not a consumer — it must materialize a new concatenated
object. The on-disk layout is byte-identical, so readers and offset
accounting are unchanged.

[bp]: https://docs.python.org/3/c-api/buffer.html

## Related issues

N/A

## Additional information

We instrument one spill and split it into three segments, derived from
three raw measurements:

- `T_python` — total duration of the Python `spill_objects` call in the
IO worker (the whole `external_storage.spill_objects`). Measured by
wrapping the delegated call with `perf_counter_ns`:
[`external_storage.py` L673-L676][tp].
- `T_io` — time inside `f.write` / `f.flush` within that call (actual
disk writes). Accumulated around the two writes ([`external_storage.py`
L207-L211][tio-w]) and the flush ([L224-L226][tio-f]).
- `T_total` — measured in the raylet C++ with `steady_clock` around the
`SpillObjects` RPC, recorded in the reply callback, so it covers the
full spill including the RPC and `T_python`: start at
[`local_object_manager.cc` L377][tt0], recorded at [L388-L392][tt1].

[tp]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L673-L676
[tio-w]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L207-L211
[tio-f]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L224-L226
[tt0]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L377
[tt1]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L388-L392

From these:

- **I/O** = `T_io` — the disk writes.
- **Python logic** = `T_python − T_io` — the rest of the Python work:
building each object's header, the payload concatenation this PR
removes, and URL/offset bookkeeping. (Objects are already serialized in
plasma, so no serialization happens here.)
- **gRPC + GIL** = `T_total − T_python` — the raylet→IO-worker RPC round
trip plus GIL acquisition, i.e. everything outside Python.

Durations are measured within each process (`steady_clock` in C++,
`perf_counter_ns` in Python) and differenced, so no cross-process clock
alignment is needed.

Profiling the filesystem spill path (object size swept 64 KB–128 MB),
the concatenation copy scales with object size and rivals disk I/O for
large objects. After the fix the Python-side time is flat and
large-object spill is ~2× faster:

| size  | T_total before → after          |
|-------|---------------------------------|
| 32MB  | 38,911 → 21,300 µs (−45%)       |
| 64MB  | 73,299 → 38,568 µs (−47%)       |
| 128MB | 525,784 → 256,012 µs (−51%)     |

Small objects (gRPC+GIL dominated) are unchanged, as expected.

**Before / after** (% of spill time: I/O vs gRPC+GIL vs Python logic):


![before](https://github.com/ShockYoungCHN/ray/raw/reform-spill/spill-bench/out/spill_breakdown_before.png)

![after](https://github.com/ShockYoungCHN/ray/raw/reform-spill/spill-bench/out/spill_breakdown_after.png)

Benchmark harness and raw data are on the `reform-spill` branch under
`spill-bench/`.

### PR history

The single concatenated write was not a performance choice and is safe
to undo:

- **ray-project#12087** (fusion, 2020) originally wrote the header and buffer as
*separate* `f.write` calls — the same pattern this PR restores.
- **ray-project#14703** This PR merged them into one `f.write(payload)` so it could
wrap the write in `try/except IOError` and use the single write's return
value for the recorded size.
- **ray-project#17016** removed that `try/except` handling. Since then the code is
just `written_bytes = f.write(payload); assert written_bytes ==
payload_len`, so the only thing the concatenation still does is force
the avoidable copy.

---------

Signed-off-by: Yuanzhuo Yang <yuanzhuoyang@gmail.com>
Signed-off-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
)

## Description

`ExternalStorage._write_multiple_objects` builds the spill payload as
`header + memoryview(buf)` and then calls `f.write(payload)`. The `+`
concatenation allocates a new `bytes` object and copies the whole object
buffer into it before writing — an avoidable Python-side memcpy.

This PR writes the header and the buffer as two separate writes instead:

```python
f.write(header)
if buf_len:
    f.write(memoryview(buf))
```

`f.write()` is a [buffer protocol][bp] consumer, so
`f.write(memoryview(buf))` reads the buffer's memory directly with no
copy; `+` is not a consumer — it must materialize a new concatenated
object. The on-disk layout is byte-identical, so readers and offset
accounting are unchanged.

[bp]: https://docs.python.org/3/c-api/buffer.html

## Related issues

N/A

## Additional information

We instrument one spill and split it into three segments, derived from
three raw measurements:

- `T_python` — total duration of the Python `spill_objects` call in the
IO worker (the whole `external_storage.spill_objects`). Measured by
wrapping the delegated call with `perf_counter_ns`:
[`external_storage.py` L673-L676][tp].
- `T_io` — time inside `f.write` / `f.flush` within that call (actual
disk writes). Accumulated around the two writes ([`external_storage.py`
L207-L211][tio-w]) and the flush ([L224-L226][tio-f]).
- `T_total` — measured in the raylet C++ with `steady_clock` around the
`SpillObjects` RPC, recorded in the reply callback, so it covers the
full spill including the RPC and `T_python`: start at
[`local_object_manager.cc` L377][tt0], recorded at [L388-L392][tt1].

[tp]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L673-L676
[tio-w]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L207-L211
[tio-f]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/python/ray/_private/external_storage.py#L224-L226
[tt0]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L377
[tt1]:
https://github.com/ShockYoungCHN/ray/blob/14bc999dec/src/ray/raylet/local_object_manager.cc#L388-L392

From these:

- **I/O** = `T_io` — the disk writes.
- **Python logic** = `T_python − T_io` — the rest of the Python work:
building each object's header, the payload concatenation this PR
removes, and URL/offset bookkeeping. (Objects are already serialized in
plasma, so no serialization happens here.)
- **gRPC + GIL** = `T_total − T_python` — the raylet→IO-worker RPC round
trip plus GIL acquisition, i.e. everything outside Python.

Durations are measured within each process (`steady_clock` in C++,
`perf_counter_ns` in Python) and differenced, so no cross-process clock
alignment is needed.

Profiling the filesystem spill path (object size swept 64 KB–128 MB),
the concatenation copy scales with object size and rivals disk I/O for
large objects. After the fix the Python-side time is flat and
large-object spill is ~2× faster:

| size  | T_total before → after          |
|-------|---------------------------------|
| 32MB  | 38,911 → 21,300 µs (−45%)       |
| 64MB  | 73,299 → 38,568 µs (−47%)       |
| 128MB | 525,784 → 256,012 µs (−51%)     |

Small objects (gRPC+GIL dominated) are unchanged, as expected.

**Before / after** (% of spill time: I/O vs gRPC+GIL vs Python logic):


![before](https://github.com/ShockYoungCHN/ray/raw/reform-spill/spill-bench/out/spill_breakdown_before.png)

![after](https://github.com/ShockYoungCHN/ray/raw/reform-spill/spill-bench/out/spill_breakdown_after.png)

Benchmark harness and raw data are on the `reform-spill` branch under
`spill-bench/`.

### PR history

The single concatenated write was not a performance choice and is safe
to undo:

- **ray-project#12087** (fusion, 2020) originally wrote the header and buffer as
*separate* `f.write` calls — the same pattern this PR restores.
- **ray-project#14703** This PR merged them into one `f.write(payload)` so it could
wrap the write in `try/except IOError` and use the single write's return
value for the recorded size.
- **ray-project#17016** removed that `try/except` handling. Since then the code is
just `written_bytes = f.write(payload); assert written_bytes ==
payload_len`, so the only thing the concatenation still does is force
the avoidable copy.

---------

Signed-off-by: Yuanzhuo Yang <yuanzhuoyang@gmail.com>
Signed-off-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Yuanzhuo Yang <yuanzhuo.yang1@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests-ok The tagger certifies test failures are unrelated and assumes personal liability.

3 participants