[Core] Compute per component memory usage in MiB#63932
Conversation
Signed-off-by: davik <davik@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request updates the system metrics ray_component_rss_mb and ray_component_uss_mb to output in bytes instead of megabytes, renaming them to ray_component_rss_bytes and ray_component_uss_bytes respectively. The changes span documentation, dashboard panels, reporter agents, tests, and release scripts. Feedback on the changes highlights two issues: first, the Gauge definitions in reporter_agent.py mistakenly specify "MiB" as the unit metadata instead of "bytes"; second, in mem_check.py, the variable uss_mb_for_agent_component and its downstream assertions still assume megabytes, which will cause assertion failures now that the metric returns bytes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Signed-off-by: davik <davik@anyscale.com>
|
https://github.com/search?q=repo%3Aray-project%2Fray+1.0e6&type=code Seems some of the paths still use |
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
| * - `ray_component_mem_shared_bytes` | ||
| - `Component`, `instance` | ||
| - The measured shared memory in bytes, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors. | ||
| * - `ray_component_uss_bytes` |
There was a problem hiding this comment.
naming is a bit inconsistent here -- should we call it mem_uss_bytes or drop the mem_ from the shared one?
edoakes
left a comment
There was a problem hiding this comment.
As I mentioned offline, I am concerned about blanket breaking compatibility on the metric naming. Can we leave the _mb ones in place but update the dashboards to use _bytes? We can also note that these ones are deprecated in our monitoring docs.
Signed-off-by: davik <davik@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 1c410b8. Configure here.
Signed-off-by: davik <davik@anyscale.com>
|
@edoakes Could you take another look when you get the chance? Thanks! |
…dge cases (#64354) ## Description The `get_contributors` release-notes script extracts PR numbers from commit subjects to look up contributor logins. The old `_find_pr_number` helper grabbed all text between `(#` and the first `)`, which produced wrong results for several real commit titles: - A truncated revert like `Revert "... hot path (#6... (#64309)` yielded `6... (#64309` instead of `64309`. - A title carrying a fixed-issue reference followed by the merging PR, e.g. `... cpu_percent (#63729) (#63733)`, yielded the issue number `63729` instead of the PR `63733`. - **Cherry-picks** such as `... in MiB (#63932) (#64042)` (original PR + backport PR) credited only one number, silently dropping the original author. This PR replaces it with `_find_pr_numbers`, which: - Matches only well-formed `(#<digits>)` tokens using a module-level compiled regex (the helper runs up to thousands of times per invocation). - Returns **every** candidate in title order. The commit text alone cannot tell an issue from a PR or an original from a backport, so `run` queries all candidates via the GitHub API and credits each one that resolves as a real PR. A cherry-pick now credits both the original author and the backporter. - Treats a `404` as expected (the number is an issue, not a PR) and collects those numbers, printing them at the end. This way, if GitHub's behavior ever changes and a real PR starts returning `404`, the dropped numbers are surfaced rather than silently lost. ## Related issues N/A ## Additional information Unit tests cover the parsing edge cases (truncated revert, issue+PR, cherry-pick, non-digit parentheticals) and the CLI behavior (both cherry-pick authors credited; an issue `404` does not drop the real PR author and is reported in the output). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
We have observed that the idle worker memory usage as reported in the per component memory usage metric grows over time. Previous investigation has concluded that it's possible for idle workers to "leak" memory due to imported libraries caching memory regions between tasks, resulting in a large idle worker footprint between runs. We have also previously uncovered that memory leaks in popular libraries can also result in idle worker memory growth: apache/arrow#39808. However a recent investigation into the trivial workload below with no library usage still showed idle worker memory growth over time. ```py import ray from tqdm import tqdm import time ray.init() @ray.remote def produce(): a = b"0" * int(0.5 * 1024**3) time.sleep(5) return a @ray.remote def consume(bytes): time.sleep(5) return "done" refs = [] for _ in tqdm(range(16)): ref = produce.remote() refs.append(ref) consume_refs = [] for i in range(4): consume_refs.append(consume.remote(ref)) print(ray.get(consume_refs)) ``` Per component metric showing idle worker memory growth: <img width="1687" height="398" alt="image" src="https://github.com/user-attachments/assets/a8201713-d64a-44ff-bfd3-d9e070dd88b3" /> This growth is attributed to how "unique set size" is calculated in the query for per component memory usage metric. Previously, we computed the unique set size via the following query: ``` (sum(ray_component_rss_mb{instance=~".+",SessionName=~".+",ray_io_cluster=~".*",ClusterId=~"ID"} * 1024 * 1024) by (Component)) - (sum(ray_component_mem_shared_bytes{instance=~".+",SessionName=~".+",ray_io_cluster=~".*",ClusterId=~"ID"}) by (Component)) ``` The problematic part is the `* 1024 * 1024` which attempts to convert the rss back into bytes. However, the RSS was actually recorded in megabytes instead mebibytes, resulting in over calculation. As the RSS for the worker grows, this calculation grows increasingly incorrect as shown by the graph below where we demonstrate the difference between computing `pre-existing query (were we incorrectly convert megabytes into bytes)` and the actual `unique set size`. <img width="1189" height="590" alt="image" src="https://github.com/user-attachments/assets/5537ac8a-efb2-4160-926c-9259540e135b" /> This PR addresses this mis-conversion by establishing bytes as the unit for all per component memory metrics. The updated graph of idle worker memory usage in the per component memory usage panel is shown below: <img width="1689" height="398" alt="image" src="https://github.com/user-attachments/assets/09437ccd-b7a3-43f4-a7cc-7a430710164e" /> --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>
…dge cases (ray-project#64354) ## Description The `get_contributors` release-notes script extracts PR numbers from commit subjects to look up contributor logins. The old `_find_pr_number` helper grabbed all text between `(#` and the first `)`, which produced wrong results for several real commit titles: - A truncated revert like `Revert "... hot path (ray-project#6... (ray-project#64309)` yielded `6... (ray-project#64309` instead of `64309`. - A title carrying a fixed-issue reference followed by the merging PR, e.g. `... cpu_percent (ray-project#63729) (ray-project#63733)`, yielded the issue number `63729` instead of the PR `63733`. - **Cherry-picks** such as `... in MiB (ray-project#63932) (ray-project#64042)` (original PR + backport PR) credited only one number, silently dropping the original author. This PR replaces it with `_find_pr_numbers`, which: - Matches only well-formed `(#<digits>)` tokens using a module-level compiled regex (the helper runs up to thousands of times per invocation). - Returns **every** candidate in title order. The commit text alone cannot tell an issue from a PR or an original from a backport, so `run` queries all candidates via the GitHub API and credits each one that resolves as a real PR. A cherry-pick now credits both the original author and the backporter. - Treats a `404` as expected (the number is an issue, not a PR) and collects those numbers, printing them at the end. This way, if GitHub's behavior ever changes and a real PR starts returning `404`, the dropped numbers are surfaced rather than silently lost. ## Related issues N/A ## Additional information Unit tests cover the parsing edge cases (truncated revert, issue+PR, cherry-pick, non-digit parentheticals) and the CLI behavior (both cherry-pick authors credited; an issue `404` does not drop the real PR author and is reported in the output). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

Description
We have observed that the idle worker memory usage as reported in the per component memory usage metric grows over time. Previous investigation has concluded that it's possible for idle workers to "leak" memory due to imported libraries caching memory regions between tasks, resulting in a large idle worker footprint between runs. We have also previously uncovered that memory leaks in popular libraries can also result in idle worker memory growth: apache/arrow#39808.
However a recent investigation into the trivial workload below with no library usage still showed idle worker memory growth over time.
Per component metric showing idle worker memory growth:

This growth is attributed to how "unique set size" is calculated in the query for per component memory usage metric. Previously, we computed the unique set size via the following query:
The problematic part is the

* 1024 * 1024which attempts to convert the rss back into bytes. However, the RSS was actually recorded in megabytes instead mebibytes, resulting in over calculation. As the RSS for the worker grows, this calculation grows increasingly incorrect as shown by the graph below where we demonstrate the difference between computingpre-existing query (were we incorrectly convert megabytes into bytes)and the actualunique set size.This PR addresses this mis-conversion by establishing bytes as the unit for all per component memory metrics. The updated graph of idle worker memory usage in the per component memory usage panel is shown below:

Related issues
Additional information