[Core] Add host vs container memory usage distinction to memory panels#63111
Conversation
Signed-off-by: davik <davik@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces separate tracking for host-level and container-level (cgroup) memory metrics within Ray. It adds a new utility function get_cgroup_mem_stats to handle both cgroup v1 and v2, updates the reporter agent to export these as distinct Prometheus metrics, and refactors the dashboard panels to visualize both host and container memory usage. The review feedback suggests improving the robustness of the cgroup file parsing by adding error handling and input validation, and recommends optimizing performance by reducing redundant calls to memory-fetching utilities.
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
| query=f"ray_node_mem_used_host{sf}", **kwargs | ||
| ), | ||
| "total_memory": client.query_prometheus( | ||
| query=f"ray_node_mem_total{sf}", **kwargs | ||
| query=f"ray_node_mem_total_host{sf}", **kwargs |
There was a problem hiding this comment.
so the original metrics will not be used/captured any more?
this seems to be changing what memory_usage means?
maybe separate this change into another PR?
There was a problem hiding this comment.
The goal is to gradually deprecate the concept of memory_usage in favor of providing both host and container memory usage. However, I agree that replacing the underlying metric for "memory_usage" can potentially be a breaking change. I will simply add the new metrics to be saved here without replacing the existing metric in this PR.
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Reviewed by Cursor Bugbot for commit 6a45e03. Configure here.
There was a problem hiding this comment.
@sai-miduthuri , could you review the release test related changes?
|
@edoakes Could you help me merge this. Thanks! |
ray-project#63111) ## Description Our existing dashboard only tracks memory utilization at the container level when we detect that the node is running within a container. This can be potentially misleading as the container memory utilization can be considerably lower than the actual host memory utilization when there are non-negligible processes running outside the container on the host. Under such scenarios, we have observed the system to experience kernel OOMs even when the memory utilization metric looked healthy as it was not reflective of the entire system's memory utilization. This PR emits both the host and container level memory utilization to provide a more comprehensive view of the system memory utilization. This way, we will be more accurately determine when the system is under memory pressure at both the container and host layer. This image below shows an example of the change. As we can see, the memory utilization panels now provide us with the ability to view either the container or host level memory usage stats. <img width="1903" height="645" alt="image" src="https://github.com/user-attachments/assets/b0f45272-c079-4512-8e1a-832dfa15706a" /> ## Related issues ## Additional information --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>

Description
Our existing dashboard only tracks memory utilization at the container level when we detect that the node is running within a container. This can be potentially misleading as the container memory utilization can be considerably lower than the actual host memory utilization when there are non-negligible processes running outside the container on the host. Under such scenarios, we have observed the system to experience kernel OOMs even when the memory utilization metric looked healthy as it was not reflective of the entire system's memory utilization.
This PR emits both the host and container level memory utilization to provide a more comprehensive view of the system memory utilization. This way, we will be more accurately determine when the system is under memory pressure at both the container and host layer.
This image below shows an example of the change. As we can see, the memory utilization panels now provide us with the ability to view either the container or host level memory usage stats.

Related issues
Additional information