[dashboard] Show TPU stats on Cluster tab #63774
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces TPU monitoring support to the Ray dashboard, updating components like GPUColumn, GRAMColumn, and index.tsx to dynamically display TPU metrics alongside GPU metrics, and adding a new TPUColumn component. It also updates the reporter agent and models to handle TPU stats. However, several critical issues were identified in the review: a potential runtime crash in the reporter agent due to treating a Pydantic model as a dictionary, and multiple potential TypeErrors in the frontend code if GPU or TPU data is null rather than undefined.
|
I don't like plumbing through alternate strings and nodes into the GPU column too much. It doesn't seem like it will scale well for other accelerators. I think it might be preferable to have a new TPU column that is conditionally hidden, and hide the GPU columns iff there are TPUs and no GPUs. It would be ugly for a cluster with a mix of TPUs and GPUs, but I suspect that is rare. |
Agreed that mixing accelerator types would be rare, but should still be supported. Could we abstract the column into some kind of base "accelerator" and then pass an enum that would key into the column names? |
32d1a71 to
a1b71bd
Compare
a1b71bd to
76c7c7a
Compare
Agreed. Just pushed all new changes in this direction. |
c83b4dc to
5696790
Compare
afab5a5 to
6f3a676
Compare
Very nice! Exactly what I was thinking :) |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 658e440. Configure here.
658e440 to
dfa4c6b
Compare
Signed-off-by: Spencer Peterson <spencerjp@google.com>
- memory usage is shown in GiB if >1024MiB - omit placeholder chips Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
dfa4c6b to
cbaebb8
Compare
## Description This change shows TPU tensor core utilization and High Bandwidth Memory utilization in the ray cluster dashboard. * TPU worker rows show tensor core util and HBM usage. * If the cluster has TPUs and no GPUs, the column names change from "GPU" and "GRAM" to "TPU" and "HBM" respectively. * If the cluster is mixed, both titles are shown, like "GPU / TPU". Example screenshot: <img width="1627" height="573" alt="image" src="https://github.com/user-attachments/assets/e093ac67-ce96-456c-8733-e306945549e4" /> ## Related issues ray-project#57829 --------- Signed-off-by: Spencer Peterson <spencerjp@google.com>
Metrics from the TPU device plugin have some quirks that cause issues for the dashboard: 1. Tensor core and memory bandwidth utilization are indexed by host (i.e. zero to N-1 where N is chips per host), while the other metrics are "runtime metrics" and indexed across the complete running slice. They will sometimes but almost never line up precisely. 2. V7X clusters I've tested against only have the host metrics, which means the dashboard must gracefully degrade to show memory utilization without absolute byte info. To work around this, the reporter agent can re-index the non-host metrics to share the same indices as the host metrics by assigning them in-order. Additionally, I've cleaned the TPU path for the accelerator rows to gracefully degrade on the missing metrics paths. ## Related issues - Addresses bugs in #63774 - To unblock chip-to-pid mapping in #63976 ## Additional information Example dashboard with correct chip ids and absolute memory: <img width="1473" height="1057" alt="image" src="https://github.com/user-attachments/assets/06b8d492-4db8-4247-99ea-72daef66716b" /> Example dashboard missing absolute GiB memory (some 0.0 rows show different visual bars because it was animating while screenshotting): <img width="1378" height="1120" alt="image" src="https://github.com/user-attachments/assets/f807bcbf-0c7a-4fa7-a8e3-94c25ad16e0b" /> --------- Signed-off-by: Spencer Peterson <spencerjp@google.com>
## Description This change shows TPU tensor core utilization and High Bandwidth Memory utilization in the ray cluster dashboard. * TPU worker rows show tensor core util and HBM usage. * If the cluster has TPUs and no GPUs, the column names change from "GPU" and "GRAM" to "TPU" and "HBM" respectively. * If the cluster is mixed, both titles are shown, like "GPU / TPU". Example screenshot: <img width="1627" height="573" alt="image" src="https://github.com/user-attachments/assets/e093ac67-ce96-456c-8733-e306945549e4" /> ## Related issues ray-project#57829 --------- Signed-off-by: Spencer Peterson <spencerjp@google.com>
Metrics from the TPU device plugin have some quirks that cause issues for the dashboard: 1. Tensor core and memory bandwidth utilization are indexed by host (i.e. zero to N-1 where N is chips per host), while the other metrics are "runtime metrics" and indexed across the complete running slice. They will sometimes but almost never line up precisely. 2. V7X clusters I've tested against only have the host metrics, which means the dashboard must gracefully degrade to show memory utilization without absolute byte info. To work around this, the reporter agent can re-index the non-host metrics to share the same indices as the host metrics by assigning them in-order. Additionally, I've cleaned the TPU path for the accelerator rows to gracefully degrade on the missing metrics paths. ## Related issues - Addresses bugs in ray-project#63774 - To unblock chip-to-pid mapping in ray-project#63976 ## Additional information Example dashboard with correct chip ids and absolute memory: <img width="1473" height="1057" alt="image" src="https://github.com/user-attachments/assets/06b8d492-4db8-4247-99ea-72daef66716b" /> Example dashboard missing absolute GiB memory (some 0.0 rows show different visual bars because it was animating while screenshotting): <img width="1378" height="1120" alt="image" src="https://github.com/user-attachments/assets/f807bcbf-0c7a-4fa7-a8e3-94c25ad16e0b" /> --------- Signed-off-by: Spencer Peterson <spencerjp@google.com>



Description
This change shows TPU tensor core utilization and High Bandwidth Memory utilization in the ray cluster dashboard.
Example screenshot:

Related issues
#57829