Skip to content

[Core] Compute per component memory usage in MiB#63932

Merged
edoakes merged 10 commits into
ray-project:masterfrom
Kunchd:per_component_metric
Jun 11, 2026
Merged

[Core] Compute per component memory usage in MiB#63932
edoakes merged 10 commits into
ray-project:masterfrom
Kunchd:per_component_metric

Conversation

@Kunchd

@Kunchd Kunchd commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Description

We have observed that the idle worker memory usage as reported in the per component memory usage metric grows over time. Previous investigation has concluded that it's possible for idle workers to "leak" memory due to imported libraries caching memory regions between tasks, resulting in a large idle worker footprint between runs. We have also previously uncovered that memory leaks in popular libraries can also result in idle worker memory growth: apache/arrow#39808.

However a recent investigation into the trivial workload below with no library usage still showed idle worker memory growth over time.

import ray
from tqdm import tqdm
import time

ray.init()

@ray.remote
def produce():
    a = b"0" * int(0.5 * 1024**3)
    time.sleep(5)
    return a

@ray.remote
def consume(bytes):
    time.sleep(5)
    return "done"

refs = []
for _ in tqdm(range(16)):
    ref = produce.remote()
    refs.append(ref)
    consume_refs = []
    for i in range(4):
        consume_refs.append(consume.remote(ref))
    print(ray.get(consume_refs))

Per component metric showing idle worker memory growth:
image

This growth is attributed to how "unique set size" is calculated in the query for per component memory usage metric. Previously, we computed the unique set size via the following query:

(sum(ray_component_rss_mb{instance=~".+",SessionName=~".+",ray_io_cluster=~".*",ClusterId=~"ID"} * 1024 * 1024) by (Component)) - (sum(ray_component_mem_shared_bytes{instance=~".+",SessionName=~".+",ray_io_cluster=~".*",ClusterId=~"ID"}) by (Component))

The problematic part is the * 1024 * 1024 which attempts to convert the rss back into bytes. However, the RSS was actually recorded in megabytes instead mebibytes, resulting in over calculation. As the RSS for the worker grows, this calculation grows increasingly incorrect as shown by the graph below where we demonstrate the difference between computing pre-existing query (were we incorrectly convert megabytes into bytes) and the actual unique set size.
image

This PR addresses this mis-conversion by establishing bytes as the unit for all per component memory metrics. The updated graph of idle worker memory usage in the per component memory usage panel is shown below:
image

Related issues

Additional information

Signed-off-by: davik <davik@anyscale.com>
@Kunchd Kunchd requested review from a team as code owners June 8, 2026 21:59
@Kunchd Kunchd added the go add ONLY when ready to merge, run all tests label Jun 8, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the system metrics ray_component_rss_mb and ray_component_uss_mb to output in bytes instead of megabytes, renaming them to ray_component_rss_bytes and ray_component_uss_bytes respectively. The changes span documentation, dashboard panels, reporter agents, tests, and release scripts. Feedback on the changes highlights two issues: first, the Gauge definitions in reporter_agent.py mistakenly specify "MiB" as the unit metadata instead of "bytes"; second, in mem_check.py, the variable uss_mb_for_agent_component and its downstream assertions still assume megabytes, which will cause assertion failures now that the metric returns bytes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/dashboard/modules/reporter/reporter_agent.py Outdated
Comment thread release/dashboard/mem_check.py
Comment thread release/dashboard/mem_check.py
Comment thread python/ray/dashboard/modules/reporter/tests/test_reporter.py
Comment thread python/ray/dashboard/modules/reporter/reporter_agent.py Outdated
Comment thread release/benchmarks/distributed/many_nodes_tests/dashboard_test.py
@Yicheng-Lu-llll Yicheng-Lu-llll self-assigned this Jun 8, 2026
@Yicheng-Lu-llll

Copy link
Copy Markdown
Member
davik added 2 commits June 9, 2026 01:06
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jun 9, 2026
Comment on lines +99 to +102
* - `ray_component_mem_shared_bytes`
- `Component`, `instance`
- The measured shared memory in bytes, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors.
* - `ray_component_uss_bytes`

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming is a bit inconsistent here -- should we call it mem_uss_bytes or drop the mem_ from the shared one?

@edoakes edoakes left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned offline, I am concerned about blanket breaking compatibility on the metric naming. Can we leave the _mb ones in place but update the dashboards to use _bytes? We can also note that these ones are deprecated in our monitoring docs.

Signed-off-by: davik <davik@anyscale.com>
Comment thread python/ray/dashboard/modules/reporter/reporter_agent.py Outdated
davik and others added 2 commits June 10, 2026 20:51
Signed-off-by: davik <davik@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 1c410b8. Configure here.

Comment thread release/dashboard/mem_check.py
@Kunchd

Kunchd commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@edoakes Could you take another look when you get the chance? Thanks!

@edoakes edoakes left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫡

@edoakes edoakes merged commit 51eb67f into ray-project:master Jun 11, 2026
6 checks passed
edoakes added a commit that referenced this pull request Jun 11, 2026
Cherry-pick #63932

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
Co-authored-by: davik <davik@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jun 25, 2026
…dge cases (#64354)

## Description

The `get_contributors` release-notes script extracts PR numbers from
commit subjects to look up contributor logins. The old `_find_pr_number`
helper grabbed all text between `(#` and the first `)`, which produced
wrong results for several real commit titles:

- A truncated revert like `Revert "... hot path (#6... (#64309)` yielded
`6... (#64309` instead of `64309`.
- A title carrying a fixed-issue reference followed by the merging PR,
e.g. `... cpu_percent (#63729) (#63733)`, yielded the issue number
`63729` instead of the PR `63733`.
- **Cherry-picks** such as `... in MiB (#63932) (#64042)` (original PR +
backport PR) credited only one number, silently dropping the original
author.

This PR replaces it with `_find_pr_numbers`, which:

- Matches only well-formed `(#<digits>)` tokens using a module-level
compiled regex (the helper runs up to thousands of times per
invocation).
- Returns **every** candidate in title order. The commit text alone
cannot tell an issue from a PR or an original from a backport, so `run`
queries all candidates via the GitHub API and credits each one that
resolves as a real PR. A cherry-pick now credits both the original
author and the backporter.
- Treats a `404` as expected (the number is an issue, not a PR) and
collects those numbers, printing them at the end. This way, if GitHub's
behavior ever changes and a real PR starts returning `404`, the dropped
numbers are surfaced rather than silently lost.

## Related issues

N/A

## Additional information

Unit tests cover the parsing edge cases (truncated revert, issue+PR,
cherry-pick, non-digit parentheticals) and the CLI behavior (both
cherry-pick authors credited; an issue `404` does not drop the real PR
author and is reported in the output).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
We have observed that the idle worker memory usage as reported in the
per component memory usage metric grows over time. Previous
investigation has concluded that it's possible for idle workers to
"leak" memory due to imported libraries caching memory regions between
tasks, resulting in a large idle worker footprint between runs. We have
also previously uncovered that memory leaks in popular libraries can
also result in idle worker memory growth:
apache/arrow#39808.

However a recent investigation into the trivial workload below with no
library usage still showed idle worker memory growth over time.
```py
import ray
from tqdm import tqdm
import time

ray.init()

@ray.remote
def produce():
    a = b"0" * int(0.5 * 1024**3)
    time.sleep(5)
    return a

@ray.remote
def consume(bytes):
    time.sleep(5)
    return "done"

refs = []
for _ in tqdm(range(16)):
    ref = produce.remote()
    refs.append(ref)
    consume_refs = []
    for i in range(4):
        consume_refs.append(consume.remote(ref))
    print(ray.get(consume_refs))
```
Per component metric showing idle worker memory growth:
<img width="1687" height="398" alt="image"
src="https://github.com/user-attachments/assets/a8201713-d64a-44ff-bfd3-d9e070dd88b3"
/>

This growth is attributed to how "unique set size" is calculated in the
query for per component memory usage metric. Previously, we computed the
unique set size via the following query:
```
(sum(ray_component_rss_mb{instance=~".+",SessionName=~".+",ray_io_cluster=~".*",ClusterId=~"ID"} * 1024 * 1024) by (Component)) - (sum(ray_component_mem_shared_bytes{instance=~".+",SessionName=~".+",ray_io_cluster=~".*",ClusterId=~"ID"}) by (Component))
```
The problematic part is the `* 1024 * 1024` which attempts to convert
the rss back into bytes. However, the RSS was actually recorded in
megabytes instead mebibytes, resulting in over calculation. As the RSS
for the worker grows, this calculation grows increasingly incorrect as
shown by the graph below where we demonstrate the difference between
computing `pre-existing query (were we incorrectly convert megabytes
into bytes)` and the actual `unique set size`.
<img width="1189" height="590" alt="image"
src="https://github.com/user-attachments/assets/5537ac8a-efb2-4160-926c-9259540e135b"
/>

This PR addresses this mis-conversion by establishing bytes as the unit
for all per component memory metrics. The updated graph of idle worker
memory usage in the per component memory usage panel is shown below:
<img width="1689" height="398" alt="image"
src="https://github.com/user-attachments/assets/09437ccd-b7a3-43f4-a7cc-7a430710164e"
/>


---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…dge cases (ray-project#64354)

## Description

The `get_contributors` release-notes script extracts PR numbers from
commit subjects to look up contributor logins. The old `_find_pr_number`
helper grabbed all text between `(#` and the first `)`, which produced
wrong results for several real commit titles:

- A truncated revert like `Revert "... hot path (ray-project#6... (ray-project#64309)` yielded
`6... (ray-project#64309` instead of `64309`.
- A title carrying a fixed-issue reference followed by the merging PR,
e.g. `... cpu_percent (ray-project#63729) (ray-project#63733)`, yielded the issue number
`63729` instead of the PR `63733`.
- **Cherry-picks** such as `... in MiB (ray-project#63932) (ray-project#64042)` (original PR +
backport PR) credited only one number, silently dropping the original
author.

This PR replaces it with `_find_pr_numbers`, which:

- Matches only well-formed `(#<digits>)` tokens using a module-level
compiled regex (the helper runs up to thousands of times per
invocation).
- Returns **every** candidate in title order. The commit text alone
cannot tell an issue from a PR or an original from a backport, so `run`
queries all candidates via the GitHub API and credits each one that
resolves as a real PR. A cherry-pick now credits both the original
author and the backporter.
- Treats a `404` as expected (the number is an issue, not a PR) and
collects those numbers, printing them at the end. This way, if GitHub's
behavior ever changes and a real PR starts returning `404`, the dropped
numbers are surfaced rather than silently lost.

## Related issues

N/A

## Additional information

Unit tests cover the parsing edge cases (truncated revert, issue+PR,
cherry-pick, non-digit parentheticals) and the CLI behavior (both
cherry-pick authors credited; an issue `404` does not drop the real PR
author and is reported in the output).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

3 participants