[Core] Compute per component memory usage in MiB#63932

Merged

edoakes merged 10 commits into

ray-project:masterfrom

Kunchd:per_component_metric

Jun 11, 2026

Kunchd commented Jun 8, 2026

Contributor

Description

We have observed that the idle worker memory usage as reported in the per component memory usage metric grows over time. Previous investigation has concluded that it's possible for idle workers to "leak" memory due to imported libraries caching memory regions between tasks, resulting in a large idle worker footprint between runs. We have also previously uncovered that memory leaks in popular libraries can also result in idle worker memory growth: apache/arrow#39808.

However a recent investigation into the trivial workload below with no library usage still showed idle worker memory growth over time.

import ray
from tqdm import tqdm
import time

ray.init()

@ray.remote
def produce():
    a = b"0" * int(0.5 * 1024**3)
    time.sleep(5)
    return a

@ray.remote
def consume(bytes):
    time.sleep(5)
    return "done"

refs = []
for _ in tqdm(range(16)):
    ref = produce.remote()
    refs.append(ref)
    consume_refs = []
    for i in range(4):
        consume_refs.append(consume.remote(ref))
    print(ray.get(consume_refs))

Per component metric showing idle worker memory growth:

This growth is attributed to how "unique set size" is calculated in the query for per component memory usage metric. Previously, we computed the unique set size via the following query:

(sum(ray_component_rss_mb{instance=~".+",SessionName=~".+",ray_io_cluster=~".*",ClusterId=~"ID"} * 1024 * 1024) by (Component)) - (sum(ray_component_mem_shared_bytes{instance=~".+",SessionName=~".+",ray_io_cluster=~".*",ClusterId=~"ID"}) by (Component))

The problematic part is the * 1024 * 1024 which attempts to convert the rss back into bytes. However, the RSS was actually recorded in megabytes instead mebibytes, resulting in over calculation. As the RSS for the worker grows, this calculation grows increasingly incorrect as shown by the graph below where we demonstrate the difference between computing pre-existing query (were we incorrectly convert megabytes into bytes) and the actual unique set size.

This PR addresses this mis-conversion by establishing bytes as the unit for all per component memory metrics. The updated graph of idle worker memory usage in the per component memory usage panel is shown below:

Related issues

Additional information


          Compute per component memory usage in bytes

47428a0

Signed-off-by: davik <davik@anyscale.com>

Kunchd requested review from a team as code owners

June 8, 2026 21:59

Kunchd added the go label

gemini-code-assist Bot reviewed

View reviewed changes

gemini-code-assist Bot left a comment

Contributor

Code Review

This pull request updates the system metrics ray_component_rss_mb and ray_component_uss_mb to output in bytes instead of megabytes, renaming them to ray_component_rss_bytes and ray_component_uss_bytes respectively. The changes span documentation, dashboard panels, reporter agents, tests, and release scripts. Feedback on the changes highlights two issues: first, the Gauge definitions in reporter_agent.py mistakenly specify "MiB" as the unit metadata instead of "bytes"; second, in mem_check.py, the variable uss_mb_for_agent_component and its downstream assertions still assume megabytes, which will cause assertion failures now that the metric returns bytes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

python/ray/dashboard/modules/reporter/reporter_agent.py Outdated

release/dashboard/mem_check.py

cursor Bot reviewed

View reviewed changes

release/dashboard/mem_check.py

python/ray/dashboard/modules/reporter/tests/test_reporter.py

python/ray/dashboard/modules/reporter/reporter_agent.py Outdated

release/benchmarks/distributed/many_nodes_tests/dashboard_test.py

davik and others added 2 commits

June 8, 2026 23:39


          Update tests to reflect bytes level per component metric

45a9db5

Signed-off-by: davik <davik@anyscale.com>


          Merge branch 'master' into per_component_metric

b7f059d

Yicheng-Lu-llll self-assigned this

Yicheng-Lu-llll approved these changes

View reviewed changes

Yicheng-Lu-llll commented Jun 9, 2026

Member

https://github.com/search?q=repo%3Aray-project%2Fray+1.0e6&type=code

Seems some of the paths still use/ 1.0e6

davik added 2 commits

June 9, 2026 01:06


          Update component rss, uss on dashboard head to also report bytes

c191a7a

Signed-off-by: davik <davik@anyscale.com>


          Update test reporter to assert on bytes

b5ae429

Signed-off-by: davik <davik@anyscale.com>

ray-gardener Bot added core observability labels

edoakes reviewed

View reviewed changes

doc/source/ray-observability/reference/system-metrics.rst Outdated

Comment on lines +99 to +102

+                 * - `ray_component_mem_shared_bytes`
+                   - `Component`, `instance`
+                   - The measured shared memory in bytes, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors.
+                 * - `ray_component_uss_bytes`

edoakes Jun 9, 2026

Collaborator

naming is a bit inconsistent here -- should we call it mem_uss_bytes or drop the mem_ from the shared one?

edoakes reviewed

View reviewed changes

edoakes left a comment

Collaborator

As I mentioned offline, I am concerned about blanket breaking compatibility on the metric naming. Can we leave the _mb ones in place but update the dashboards to use _bytes? We can also note that these ones are deprecated in our monitoring docs.

edoakes requested changes

View reviewed changes


          Add mb metrics for backwards compatibility

6dd0d87

Signed-off-by: davik <davik@anyscale.com>

cursor Bot reviewed

View reviewed changes

python/ray/dashboard/modules/reporter/reporter_agent.py Outdated

davik and others added 2 commits

June 10, 2026 20:51


          Fix typo

d941692

Signed-off-by: davik <davik@anyscale.com>


          Merge branch 'master' into per_component_metric

1c410b8

cursor Bot reviewed

View reviewed changes

cursor Bot left a comment

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 1c410b8. Configure here.}

release/dashboard/mem_check.py

davik and others added 2 commits

June 11, 2026 00:29


          Update test_reporter.py to reflect newly added guage counts

391ca62

Signed-off-by: davik <davik@anyscale.com>


          Merge branch 'master' into per_component_metric

22abe28

Kunchd commented Jun 11, 2026

Contributor Author

@edoakes Could you take another look when you get the chance? Thanks!

edoakes approved these changes

View reviewed changes

edoakes left a comment

Collaborator

🫡

edoakes merged commit 51eb67f into ray-project:master

6 checks passed

edoakes mentioned this pull request

[Core] Compute per component memory usage in MiB (#63932) #64042

Merged

edoakes added a commit that referenced this pull request


          [Core] Compute per component memory usage in MiB (#63932) (#64042)

f211d8a

Cherry-pick #63932

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
Co-authored-by: davik <davik@anyscale.com>

elliot-barn pushed a commit that referenced this pull request


          [ci] Fix PR number parsing in get_contributors for cherry-picks and e…

e740785

…dge cases (#64354)

## Description

The `get_contributors` release-notes script extracts PR numbers from
commit subjects to look up contributor logins. The old `_find_pr_number`
helper grabbed all text between `(#` and the first `)`, which produced
wrong results for several real commit titles:

- A truncated revert like `Revert "... hot path (#6... (#64309)` yielded
`6... (#64309` instead of `64309`.
- A title carrying a fixed-issue reference followed by the merging PR,
e.g. `... cpu_percent (#63729) (#63733)`, yielded the issue number
`63729` instead of the PR `63733`.
- **Cherry-picks** such as `... in MiB (#63932) (#64042)` (original PR +
backport PR) credited only one number, silently dropping the original
author.

This PR replaces it with `_find_pr_numbers`, which:

- Matches only well-formed `(#<digits>)` tokens using a module-level
compiled regex (the helper runs up to thousands of times per
invocation).
- Returns **every** candidate in title order. The commit text alone
cannot tell an issue from a PR or an original from a backport, so `run`
queries all candidates via the GitHub API and credits each one that
resolves as a real PR. A cherry-pick now credits both the original
author and the backporter.
- Treats a `404` as expected (the number is an issue, not a PR) and
collects those numbers, printing them at the end. This way, if GitHub's
behavior ever changes and a real PR starts returning `404`, the dropped
numbers are surfaced rather than silently lost.

## Related issues

N/A

## Additional information

Unit tests cover the parsing edge cases (truncated revert, issue+PR,
cherry-pick, non-digit parentheticals) and the CLI behavior (both
cherry-pick authors credited; an issue `404` does not drop the real PR
author and is reported in the output).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request


          [Core] Compute per component memory usage in MiB (ray-project#63932)

d66788c

We have observed that the idle worker memory usage as reported in the
per component memory usage metric grows over time. Previous
investigation has concluded that it's possible for idle workers to
"leak" memory due to imported libraries caching memory regions between
tasks, resulting in a large idle worker footprint between runs. We have
also previously uncovered that memory leaks in popular libraries can
also result in idle worker memory growth:
apache/arrow#39808.

However a recent investigation into the trivial workload below with no
library usage still showed idle worker memory growth over time.
```py
import ray
from tqdm import tqdm
import time

ray.init()

@ray.remote
def produce():
    a = b"0" * int(0.5 * 1024**3)
    time.sleep(5)
    return a

@ray.remote
def consume(bytes):
    time.sleep(5)
    return "done"

refs = []
for _ in tqdm(range(16)):
    ref = produce.remote()
    refs.append(ref)
    consume_refs = []
    for i in range(4):
        consume_refs.append(consume.remote(ref))
    print(ray.get(consume_refs))
```
Per component metric showing idle worker memory growth:
<img width="1687" height="398" alt="image"
src="https://github.com/user-attachments/assets/a8201713-d64a-44ff-bfd3-d9e070dd88b3"
/>

This growth is attributed to how "unique set size" is calculated in the
query for per component memory usage metric. Previously, we computed the
unique set size via the following query:
```
(sum(ray_component_rss_mb{instance=~".+",SessionName=~".+",ray_io_cluster=~".*",ClusterId=~"ID"} * 1024 * 1024) by (Component)) - (sum(ray_component_mem_shared_bytes{instance=~".+",SessionName=~".+",ray_io_cluster=~".*",ClusterId=~"ID"}) by (Component))
```
The problematic part is the `* 1024 * 1024` which attempts to convert
the rss back into bytes. However, the RSS was actually recorded in
megabytes instead mebibytes, resulting in over calculation. As the RSS
for the worker grows, this calculation grows increasingly incorrect as
shown by the graph below where we demonstrate the difference between
computing `pre-existing query (were we incorrectly convert megabytes
into bytes)` and the actual `unique set size`.
<img width="1189" height="590" alt="image"
src="https://github.com/user-attachments/assets/5537ac8a-efb2-4160-926c-9259540e135b"
/>

This PR addresses this mis-conversion by establishing bytes as the unit
for all per component memory metrics. The updated graph of idle worker
memory usage in the per component memory usage panel is shown below:
<img width="1689" height="398" alt="image"
src="https://github.com/user-attachments/assets/09437ccd-b7a3-43f4-a7cc-7a430710164e"
/>


---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>

limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request


          [ci] Fix PR number parsing in get_contributors for cherry-picks and e…

d0f3b72

…dge cases (ray-project#64354)

## Description

The `get_contributors` release-notes script extracts PR numbers from
commit subjects to look up contributor logins. The old `_find_pr_number`
helper grabbed all text between `(#` and the first `)`, which produced
wrong results for several real commit titles:

- A truncated revert like `Revert "... hot path (ray-project#6... (ray-project#64309)` yielded
`6... (ray-project#64309` instead of `64309`.
- A title carrying a fixed-issue reference followed by the merging PR,
e.g. `... cpu_percent (ray-project#63729) (ray-project#63733)`, yielded the issue number
`63729` instead of the PR `63733`.
- **Cherry-picks** such as `... in MiB (ray-project#63932) (ray-project#64042)` (original PR +
backport PR) credited only one number, silently dropping the original
author.

This PR replaces it with `_find_pr_numbers`, which:

- Matches only well-formed `(#<digits>)` tokens using a module-level
compiled regex (the helper runs up to thousands of times per
invocation).
- Returns **every** candidate in title order. The commit text alone
cannot tell an issue from a PR or an original from a backport, so `run`
queries all candidates via the GitHub API and credits each one that
resolves as a real PR. A cherry-pick now credits both the original
author and the backporter.
- Treats a `404` as expected (the number is an issue, not a PR) and
collects those numbers, printing them at the end. This way, if GitHub's
behavior ever changes and a real PR starts returning `404`, the dropped
numbers are surfaced rather than silently lost.

## Related issues

N/A

## Additional information

Unit tests cover the parsing edge cases (truncated revert, issue+PR,
cherry-pick, non-digit parentheticals) and the CLI behavior (both
cherry-pick authors credited; an issue `404` does not drop the real PR
author and is reported in the output).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core go observability