Skip to content

[Core] Add host vs container memory usage distinction to memory panels#63111

Merged
edoakes merged 11 commits into
ray-project:masterfrom
Kunchd:host-memory
May 11, 2026
Merged

[Core] Add host vs container memory usage distinction to memory panels#63111
edoakes merged 11 commits into
ray-project:masterfrom
Kunchd:host-memory

Conversation

@Kunchd

@Kunchd Kunchd commented May 4, 2026

Copy link
Copy Markdown
Contributor

Description

Our existing dashboard only tracks memory utilization at the container level when we detect that the node is running within a container. This can be potentially misleading as the container memory utilization can be considerably lower than the actual host memory utilization when there are non-negligible processes running outside the container on the host. Under such scenarios, we have observed the system to experience kernel OOMs even when the memory utilization metric looked healthy as it was not reflective of the entire system's memory utilization.

This PR emits both the host and container level memory utilization to provide a more comprehensive view of the system memory utilization. This way, we will be more accurately determine when the system is under memory pressure at both the container and host layer.

This image below shows an example of the change. As we can see, the memory utilization panels now provide us with the ability to view either the container or host level memory usage stats.
image

Related issues

Additional information

Signed-off-by: davik <davik@anyscale.com>
@Kunchd Kunchd requested a review from a team as a code owner May 4, 2026 20:09

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces separate tracking for host-level and container-level (cgroup) memory metrics within Ray. It adds a new utility function get_cgroup_mem_stats to handle both cgroup v1 and v2, updates the reporter agent to export these as distinct Prometheus metrics, and refactors the dashboard panels to visualize both host and container memory usage. The review feedback suggests improving the robustness of the cgroup file parsing by adding error handling and input validation, and recommends optimizing performance by reducing redundant calls to memory-fetching utilities.

Comment thread python/ray/_private/utils.py Outdated
Comment thread python/ray/_private/utils.py
Comment thread python/ray/dashboard/modules/reporter/reporter_agent.py Outdated
Comment thread python/ray/dashboard/modules/metrics/dashboards/common.py
Signed-off-by: davik <davik@anyscale.com>
@Kunchd Kunchd requested review from a team as code owners May 4, 2026 23:51
@Kunchd Kunchd added the go add ONLY when ready to merge, run all tests label May 4, 2026
Comment thread python/ray/_private/utils.py
Comment thread python/ray/dashboard/modules/metrics/dashboards/default_dashboard_panels.py Outdated
@ray-gardener ray-gardener Bot added dashboard Issues specific to the Ray Dashboard core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels May 5, 2026
davik and others added 2 commits May 5, 2026 17:57
Comment on lines +85 to +88
query=f"ray_node_mem_used_host{sf}", **kwargs
),
"total_memory": client.query_prometheus(
query=f"ray_node_mem_total{sf}", **kwargs
query=f"ray_node_mem_total_host{sf}", **kwargs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the original metrics will not be used/captured any more?

this seems to be changing what memory_usage means?

maybe separate this change into another PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to gradually deprecate the concept of memory_usage in favor of providing both host and container memory usage. However, I agree that replacing the underlying metric for "memory_usage" can potentially be a breaking change. I will simply add the new metrics to be saved here without replacing the existing metric in this PR.

Comment thread python/ray/_private/utils.py
Comment thread release/ray_release/command_runner/_prometheus_metrics.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 6a45e03. Configure here.

Comment thread python/ray/_private/utils.py
Comment thread python/ray/dashboard/modules/metrics/dashboards/common.py

@rueian rueian left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@Kunchd Kunchd requested a review from aslonnie May 6, 2026 21:44

@aslonnie aslonnie left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sai-miduthuri , could you review the release test related changes?

@Kunchd

Kunchd commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

@edoakes Could you help me merge this. Thanks!

@edoakes edoakes merged commit d855e5d into ray-project:master May 11, 2026
6 checks passed
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
ray-project#63111)

## Description
Our existing dashboard only tracks memory utilization at the container
level when we detect that the node is running within a container. This
can be potentially misleading as the container memory utilization can be
considerably lower than the actual host memory utilization when there
are non-negligible processes running outside the container on the host.
Under such scenarios, we have observed the system to experience kernel
OOMs even when the memory utilization metric looked healthy as it was
not reflective of the entire system's memory utilization.

This PR emits both the host and container level memory utilization to
provide a more comprehensive view of the system memory utilization. This
way, we will be more accurately determine when the system is under
memory pressure at both the container and host layer.

This image below shows an example of the change. As we can see, the
memory utilization panels now provide us with the ability to view either
the container or host level memory usage stats.
<img width="1903" height="645" alt="image"
src="https://github.com/user-attachments/assets/b0f45272-c079-4512-8e1a-832dfa15706a"
/>


## Related issues


## Additional information

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core dashboard Issues specific to the Ray Dashboard go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

5 participants