Skip to content

[dashboard] Show TPU stats on Cluster tab #63774

Merged
edoakes merged 9 commits into
ray-project:masterfrom
spencer-p:tpu-dashboard-util
Jun 8, 2026
Merged

[dashboard] Show TPU stats on Cluster tab #63774
edoakes merged 9 commits into
ray-project:masterfrom
spencer-p:tpu-dashboard-util

Conversation

@spencer-p

@spencer-p spencer-p commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Description

This change shows TPU tensor core utilization and High Bandwidth Memory utilization in the ray cluster dashboard.

  • TPU worker rows show tensor core util and HBM usage.
  • If the cluster has TPUs and no GPUs, the column names change from "GPU" and "GRAM" to "TPU" and "HBM" respectively.
  • If the cluster is mixed, a generic title is shown.

Example screenshot:
image

Related issues

#57829

@spencer-p spencer-p requested a review from a team as a code owner June 1, 2026 17:08

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces TPU monitoring support to the Ray dashboard, updating components like GPUColumn, GRAMColumn, and index.tsx to dynamically display TPU metrics alongside GPU metrics, and adding a new TPUColumn component. It also updates the reporter agent and models to handle TPU stats. However, several critical issues were identified in the review: a potential runtime crash in the reporter agent due to treating a Pydantic model as a dictionary, and multiple potential TypeErrors in the frontend code if GPU or TPU data is null rather than undefined.

Comment thread python/ray/dashboard/modules/reporter/reporter_agent.py Outdated
Comment thread python/ray/dashboard/client/src/pages/node/GPUColumn.tsx Outdated
Comment thread python/ray/dashboard/client/src/pages/node/GRAMColumn.tsx Outdated
Comment thread python/ray/dashboard/client/src/pages/node/index.tsx Outdated
Comment thread python/ray/dashboard/client/src/pages/node/index.tsx Outdated
@spencer-p

Copy link
Copy Markdown
Contributor Author

I don't like plumbing through alternate strings and nodes into the GPU column too much. It doesn't seem like it will scale well for other accelerators.

I think it might be preferable to have a new TPU column that is conditionally hidden, and hide the GPU columns iff there are TPUs and no GPUs. It would be ugly for a cluster with a mix of TPUs and GPUs, but I suspect that is rare.

@edoakes

edoakes commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

I don't like plumbing through alternate strings and nodes into the GPU column too much. It doesn't seem like it will scale well for other accelerators.

I think it might be preferable to have a new TPU column that is conditionally hidden, and hide the GPU columns iff there are TPUs and no GPUs. It would be ugly for a cluster with a mix of TPUs and GPUs, but I suspect that is rare.

Agreed that mixing accelerator types would be rare, but should still be supported. Could we abstract the column into some kind of base "accelerator" and then pass an enum that would key into the column names?

@spencer-p spencer-p force-pushed the tpu-dashboard-util branch from 32d1a71 to a1b71bd Compare June 1, 2026 19:13
@ray-gardener ray-gardener Bot added dashboard Issues specific to the Ray Dashboard core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Jun 1, 2026
@spencer-p spencer-p force-pushed the tpu-dashboard-util branch from a1b71bd to 76c7c7a Compare June 6, 2026 00:29
Comment thread python/ray/dashboard/client/src/pages/node/AcceleratorColumn.tsx
Comment thread python/ray/dashboard/client/src/components/ActorTable.tsx
@spencer-p

Copy link
Copy Markdown
Contributor Author

Could we abstract the column into some kind of base "accelerator"

Agreed. Just pushed all new changes in this direction.

Comment thread python/ray/dashboard/client/src/pages/node/index.tsx Outdated
@spencer-p spencer-p force-pushed the tpu-dashboard-util branch from c83b4dc to 5696790 Compare June 8, 2026 17:16
Comment thread python/ray/dashboard/modules/node/datacenter.py Outdated
Comment thread python/ray/dashboard/modules/node/datacenter.py Outdated
@spencer-p spencer-p force-pushed the tpu-dashboard-util branch from afab5a5 to 6f3a676 Compare June 8, 2026 19:03
@spencer-p

Copy link
Copy Markdown
Contributor Author

The Actors tab now has stubs for TPUs, I'd prefer to complete that in another PR so we can land incremental changes and not have too much scope creep for this one.

The Clusters tab change looks great. Here's what a 4x4 cluster looks like:

image

And here's a dashboard showing one tpu v5 chip and an nvidia L4 on the same page:

image

Note the unified column with dynamic title :)

@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jun 8, 2026
@edoakes

edoakes commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Note the unified column with dynamic title :)

Very nice! Exactly what I was thinking :)

@edoakes edoakes enabled auto-merge (squash) June 8, 2026 20:09
@github-actions github-actions Bot disabled auto-merge June 8, 2026 20:09

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 658e440. Configure here.

Comment thread python/ray/dashboard/client/src/pages/node/index.tsx
@spencer-p spencer-p force-pushed the tpu-dashboard-util branch from 658e440 to dfa4c6b Compare June 8, 2026 21:00
spencer-p added 5 commits June 8, 2026 21:02
Signed-off-by: Spencer Peterson <spencerjp@google.com>
- memory usage is shown in GiB if >1024MiB
- omit placeholder chips

Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
spencer-p added 4 commits June 8, 2026 21:02
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
@spencer-p spencer-p force-pushed the tpu-dashboard-util branch from dfa4c6b to cbaebb8 Compare June 8, 2026 21:02
@edoakes edoakes enabled auto-merge (squash) June 8, 2026 21:06
@edoakes edoakes merged commit 29abd06 into ray-project:master Jun 8, 2026
7 checks passed
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Jun 10, 2026
## Description

This change shows TPU tensor core utilization and High Bandwidth Memory
utilization in the ray cluster dashboard.

* TPU worker rows show tensor core util and HBM usage.
* If the cluster has TPUs and no GPUs, the column names change from
"GPU" and "GRAM" to "TPU" and "HBM" respectively.
* If the cluster is mixed, both titles are shown, like "GPU / TPU".

Example screenshot:
<img width="1627" height="573" alt="image"
src="https://github.com/user-attachments/assets/e093ac67-ce96-456c-8733-e306945549e4"
/>


## Related issues

ray-project#57829

---------

Signed-off-by: Spencer Peterson <spencerjp@google.com>
edoakes pushed a commit that referenced this pull request Jun 11, 2026
Metrics from the TPU device plugin have some quirks that cause issues
for the
dashboard:

1. Tensor core and memory bandwidth utilization are indexed by host
(i.e. zero
to N-1 where N is chips per host), while the other metrics are "runtime
metrics" and indexed across the complete running slice. They will
sometimes
   but almost never line up precisely.
2. V7X clusters I've tested against only have the host metrics, which
means the
dashboard must gracefully degrade to show memory utilization without
absolute
   byte info.

To work around this, the reporter agent can re-index the non-host
metrics to
share the same indices as the host metrics by assigning them in-order.

Additionally, I've cleaned the TPU path for the accelerator rows to
gracefully
degrade on the missing metrics paths.

## Related issues

- Addresses bugs in #63774
- To unblock chip-to-pid mapping in #63976

## Additional information

Example dashboard with correct chip ids and absolute memory:
<img width="1473" height="1057" alt="image"
src="https://github.com/user-attachments/assets/06b8d492-4db8-4247-99ea-72daef66716b"
/>


Example dashboard missing absolute GiB memory (some 0.0 rows show
different visual bars because it was animating while screenshotting):
<img width="1378" height="1120" alt="image"
src="https://github.com/user-attachments/assets/f807bcbf-0c7a-4fa7-a8e3-94c25ad16e0b"
/>

---------

Signed-off-by: Spencer Peterson <spencerjp@google.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
## Description

This change shows TPU tensor core utilization and High Bandwidth Memory
utilization in the ray cluster dashboard.

* TPU worker rows show tensor core util and HBM usage.
* If the cluster has TPUs and no GPUs, the column names change from
"GPU" and "GRAM" to "TPU" and "HBM" respectively.
* If the cluster is mixed, both titles are shown, like "GPU / TPU".

Example screenshot:
<img width="1627" height="573" alt="image"
src="https://github.com/user-attachments/assets/e093ac67-ce96-456c-8733-e306945549e4"
/>


## Related issues

ray-project#57829

---------

Signed-off-by: Spencer Peterson <spencerjp@google.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
Metrics from the TPU device plugin have some quirks that cause issues
for the
dashboard:

1. Tensor core and memory bandwidth utilization are indexed by host
(i.e. zero
to N-1 where N is chips per host), while the other metrics are "runtime
metrics" and indexed across the complete running slice. They will
sometimes
   but almost never line up precisely.
2. V7X clusters I've tested against only have the host metrics, which
means the
dashboard must gracefully degrade to show memory utilization without
absolute
   byte info.

To work around this, the reporter agent can re-index the non-host
metrics to
share the same indices as the host metrics by assigning them in-order.

Additionally, I've cleaned the TPU path for the accelerator rows to
gracefully
degrade on the missing metrics paths.

## Related issues

- Addresses bugs in ray-project#63774
- To unblock chip-to-pid mapping in ray-project#63976

## Additional information

Example dashboard with correct chip ids and absolute memory:
<img width="1473" height="1057" alt="image"
src="https://github.com/user-attachments/assets/06b8d492-4db8-4247-99ea-72daef66716b"
/>


Example dashboard missing absolute GiB memory (some 0.0 rows show
different visual bars because it was animating while screenshotting):
<img width="1378" height="1120" alt="image"
src="https://github.com/user-attachments/assets/f807bcbf-0c7a-4fa7-a8e3-94c25ad16e0b"
/>

---------

Signed-off-by: Spencer Peterson <spencerjp@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core dashboard Issues specific to the Ray Dashboard go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

2 participants