[dashboard] Show TPU stats on Cluster tab by spencer-p · Pull Request #63774 · ray-project/ray

spencer-p · 2026-06-01T17:08:00Z

Description

This change shows TPU tensor core utilization and High Bandwidth Memory utilization in the ray cluster dashboard.

TPU worker rows show tensor core util and HBM usage.
If the cluster has TPUs and no GPUs, the column names change from "GPU" and "GRAM" to "TPU" and "HBM" respectively.
If the cluster is mixed, a generic title is shown.

Example screenshot:

Related issues

#57829

gemini-code-assist

Code Review

This pull request introduces TPU monitoring support to the Ray dashboard, updating components like GPUColumn, GRAMColumn, and index.tsx to dynamically display TPU metrics alongside GPU metrics, and adding a new TPUColumn component. It also updates the reporter agent and models to handle TPU stats. However, several critical issues were identified in the review: a potential runtime crash in the reporter agent due to treating a Pydantic model as a dictionary, and multiple potential TypeErrors in the frontend code if GPU or TPU data is null rather than undefined.

spencer-p · 2026-06-01T18:32:17Z

I don't like plumbing through alternate strings and nodes into the GPU column too much. It doesn't seem like it will scale well for other accelerators.

I think it might be preferable to have a new TPU column that is conditionally hidden, and hide the GPU columns iff there are TPUs and no GPUs. It would be ugly for a cluster with a mix of TPUs and GPUs, but I suspect that is rare.

edoakes · 2026-06-01T19:06:00Z

I don't like plumbing through alternate strings and nodes into the GPU column too much. It doesn't seem like it will scale well for other accelerators.

I think it might be preferable to have a new TPU column that is conditionally hidden, and hide the GPU columns iff there are TPUs and no GPUs. It would be ugly for a cluster with a mix of TPUs and GPUs, but I suspect that is rare.

Agreed that mixing accelerator types would be rare, but should still be supported. Could we abstract the column into some kind of base "accelerator" and then pass an enum that would key into the column names?

spencer-p · 2026-06-06T00:31:00Z

Could we abstract the column into some kind of base "accelerator"

Agreed. Just pushed all new changes in this direction.

spencer-p · 2026-06-08T19:06:22Z

The Actors tab now has stubs for TPUs, I'd prefer to complete that in another PR so we can land incremental changes and not have too much scope creep for this one.

The Clusters tab change looks great. Here's what a 4x4 cluster looks like:

And here's a dashboard showing one tpu v5 chip and an nvidia L4 on the same page:

Note the unified column with dynamic title :)

edoakes · 2026-06-08T20:09:12Z

Note the unified column with dynamic title :)

Very nice! Exactly what I was thinking :)

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 658e440. Configure here.}

Signed-off-by: Spencer Peterson <spencerjp@google.com>

- memory usage is shown in GiB if >1024MiB - omit placeholder chips Signed-off-by: Spencer Peterson <spencerjp@google.com>

Signed-off-by: Spencer Peterson <spencerjp@google.com>

## Description This change shows TPU tensor core utilization and High Bandwidth Memory utilization in the ray cluster dashboard. * TPU worker rows show tensor core util and HBM usage. * If the cluster has TPUs and no GPUs, the column names change from "GPU" and "GRAM" to "TPU" and "HBM" respectively. * If the cluster is mixed, both titles are shown, like "GPU / TPU". Example screenshot: <img width="1627" height="573" alt="image" src="https://github.com/user-attachments/assets/e093ac67-ce96-456c-8733-e306945549e4" /> ## Related issues ray-project#57829 --------- Signed-off-by: Spencer Peterson <spencerjp@google.com>

Metrics from the TPU device plugin have some quirks that cause issues for the dashboard: 1. Tensor core and memory bandwidth utilization are indexed by host (i.e. zero to N-1 where N is chips per host), while the other metrics are "runtime metrics" and indexed across the complete running slice. They will sometimes but almost never line up precisely. 2. V7X clusters I've tested against only have the host metrics, which means the dashboard must gracefully degrade to show memory utilization without absolute byte info. To work around this, the reporter agent can re-index the non-host metrics to share the same indices as the host metrics by assigning them in-order. Additionally, I've cleaned the TPU path for the accelerator rows to gracefully degrade on the missing metrics paths. ## Related issues - Addresses bugs in #63774 - To unblock chip-to-pid mapping in #63976 ## Additional information Example dashboard with correct chip ids and absolute memory: <img width="1473" height="1057" alt="image" src="https://github.com/user-attachments/assets/06b8d492-4db8-4247-99ea-72daef66716b" /> Example dashboard missing absolute GiB memory (some 0.0 rows show different visual bars because it was animating while screenshotting): <img width="1378" height="1120" alt="image" src="https://github.com/user-attachments/assets/f807bcbf-0c7a-4fa7-a8e3-94c25ad16e0b" /> --------- Signed-off-by: Spencer Peterson <spencerjp@google.com>

## Description This change shows TPU tensor core utilization and High Bandwidth Memory utilization in the ray cluster dashboard. * TPU worker rows show tensor core util and HBM usage. * If the cluster has TPUs and no GPUs, the column names change from "GPU" and "GRAM" to "TPU" and "HBM" respectively. * If the cluster is mixed, both titles are shown, like "GPU / TPU". Example screenshot: <img width="1627" height="573" alt="image" src="https://github.com/user-attachments/assets/e093ac67-ce96-456c-8733-e306945549e4" /> ## Related issues ray-project#57829 --------- Signed-off-by: Spencer Peterson <spencerjp@google.com>

Metrics from the TPU device plugin have some quirks that cause issues for the dashboard: 1. Tensor core and memory bandwidth utilization are indexed by host (i.e. zero to N-1 where N is chips per host), while the other metrics are "runtime metrics" and indexed across the complete running slice. They will sometimes but almost never line up precisely. 2. V7X clusters I've tested against only have the host metrics, which means the dashboard must gracefully degrade to show memory utilization without absolute byte info. To work around this, the reporter agent can re-index the non-host metrics to share the same indices as the host metrics by assigning them in-order. Additionally, I've cleaned the TPU path for the accelerator rows to gracefully degrade on the missing metrics paths. ## Related issues - Addresses bugs in ray-project#63774 - To unblock chip-to-pid mapping in ray-project#63976 ## Additional information Example dashboard with correct chip ids and absolute memory: <img width="1473" height="1057" alt="image" src="https://github.com/user-attachments/assets/06b8d492-4db8-4247-99ea-72daef66716b" /> Example dashboard missing absolute GiB memory (some 0.0 rows show different visual bars because it was animating while screenshotting): <img width="1378" height="1120" alt="image" src="https://github.com/user-attachments/assets/f807bcbf-0c7a-4fa7-a8e3-94c25ad16e0b" /> --------- Signed-off-by: Spencer Peterson <spencerjp@google.com>

spencer-p requested a review from a team as a code owner June 1, 2026 17:08

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

cursor Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread python/ray/dashboard/client/src/pages/node/index.tsx Outdated

spencer-p force-pushed the tpu-dashboard-util branch from 32d1a71 to a1b71bd Compare June 1, 2026 19:13

ray-gardener Bot added dashboard Issues specific to the Ray Dashboard core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Jun 1, 2026

spencer-p force-pushed the tpu-dashboard-util branch from a1b71bd to 76c7c7a Compare June 6, 2026 00:29

cursor Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread python/ray/dashboard/client/src/pages/node/AcceleratorColumn.tsx

Comment thread python/ray/dashboard/client/src/components/ActorTable.tsx

cursor Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread python/ray/dashboard/client/src/pages/node/index.tsx Outdated

spencer-p force-pushed the tpu-dashboard-util branch from c83b4dc to 5696790 Compare June 8, 2026 17:16

cursor Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread python/ray/dashboard/modules/node/datacenter.py Outdated

Comment thread python/ray/dashboard/modules/node/datacenter.py Outdated

spencer-p force-pushed the tpu-dashboard-util branch from afab5a5 to 6f3a676 Compare June 8, 2026 19:03

edoakes added the go add ONLY when ready to merge, run all tests label Jun 8, 2026

edoakes approved these changes Jun 8, 2026

View reviewed changes

edoakes enabled auto-merge (squash) June 8, 2026 20:09

github-actions Bot disabled auto-merge June 8, 2026 20:09

cursor Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread python/ray/dashboard/client/src/pages/node/index.tsx

spencer-p force-pushed the tpu-dashboard-util branch from 658e440 to dfa4c6b Compare June 8, 2026 21:00

spencer-p added 5 commits June 8, 2026 21:02

[dashboard] TPU utilization on Cluster, Actor tabs

f2fc11a

Signed-off-by: Spencer Peterson <spencerjp@google.com>

[dashboard] TPU nits

b576839

- memory usage is shown in GiB if >1024MiB - omit placeholder chips Signed-off-by: Spencer Peterson <spencerjp@google.com>

[dashboard] Refactor into generic Accelerator

7035e6c

Signed-off-by: Spencer Peterson <spencerjp@google.com>

[dashboard] Dynamically label accelerator columns

9bb0f62

Signed-off-by: Spencer Peterson <spencerjp@google.com>

formatting fixes

2994215

Signed-off-by: Spencer Peterson <spencerjp@google.com>

spencer-p added 4 commits June 8, 2026 21:02

fix TPU metrics types in dashboard reporter

12ea50a

Signed-off-by: Spencer Peterson <spencerjp@google.com>

[dashboard] infer pid for TPUs from resources

4038cf4

Signed-off-by: Spencer Peterson <spencerjp@google.com>

[dashboard] Refactor Accelerator column labeling

e6bad6e

Signed-off-by: Spencer Peterson <spencerjp@google.com>

[dashboard] stub TPU stats for Actors tab

cbaebb8

Signed-off-by: Spencer Peterson <spencerjp@google.com>

spencer-p force-pushed the tpu-dashboard-util branch from dfa4c6b to cbaebb8 Compare June 8, 2026 21:02

edoakes enabled auto-merge (squash) June 8, 2026 21:06

edoakes merged commit 29abd06 into ray-project:master Jun 8, 2026
7 checks passed

spencer-p mentioned this pull request Jun 9, 2026

[dashboard] TPU metrics per-actor in Actor table #63976

Draft

spencer-p mentioned this pull request Jun 10, 2026

[dashboard] Fix TPU metrics #63998

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[dashboard] Show TPU stats on Cluster tab #63774

[dashboard] Show TPU stats on Cluster tab #63774
edoakes merged 9 commits into
ray-project:masterfrom
spencer-p:tpu-dashboard-util

spencer-p commented Jun 1, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spencer-p commented Jun 1, 2026

edoakes commented Jun 1, 2026

Uh oh!

Uh oh!

spencer-p commented Jun 6, 2026

Uh oh!

Uh oh!

Uh oh!

spencer-p commented Jun 8, 2026

edoakes commented Jun 8, 2026

cursor Bot left a comment

Uh oh!

Uh oh!

Labels

2 participants

Uh oh!

Conversation

spencer-p commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spencer-p commented Jun 1, 2026

edoakes commented Jun 1, 2026

Uh oh!

Uh oh!

spencer-p commented Jun 6, 2026

Uh oh!

Uh oh!

Uh oh!

spencer-p commented Jun 8, 2026

edoakes commented Jun 8, 2026

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

2 participants

spencer-p commented Jun 1, 2026 •

edited

Loading