Skip to content

[core][observability] Normalize OTel metric labels before Prometheus export#63744

Merged
edoakes merged 2 commits into
ray-project:masterfrom
OneSizeFitsQuorum:txy/fix-otel-prometheus-label-normalization
Jun 6, 2026
Merged

[core][observability] Normalize OTel metric labels before Prometheus export#63744
edoakes merged 2 commits into
ray-project:masterfrom
OneSizeFitsQuorum:txy/fix-otel-prometheus-label-normalization

Conversation

@OneSizeFitsQuorum

Copy link
Copy Markdown
Contributor

Description

This PR normalizes OpenTelemetry metric attribute sets before handing observations to the Prometheus exporter.

Some Ray components can emit the same metric with heterogeneous attribute sets, for example when one data point includes SessionName and another data point for the same metric does not. With older opentelemetry-exporter-prometheus versions used by Ray's default compiled dependencies, metrics can reach Prometheus export with mixed label key sets. This can produce misaligned Prometheus label values, such as dataset="core_worker" or operator="<node ip>", making Ray Data dashboards misleading.

This change makes Ray enforce a stable label schema at the export boundary:

  • For observable gauge, counter, and sum callbacks, collect the union of attribute keys for each metric and fill missing values with "".
  • For reconstructed histogram batches in the dashboard reporter, normalize all batch data points to the union of tag keys before recording them.
  • Add regression coverage for mixed attribute sets in observable metric callbacks and histogram export.

This does not depend on upgrading opentelemetry-exporter-prometheus. It is also compatible with newer exporter versions that perform similar normalization internally; in that case Ray provides already-normalized observations and the exporter-side normalization is effectively idempotent.

Related issues

Fixes #63499.

Additional information

This PR intentionally keeps the fix in the Python export path. The issue is caused by heterogeneous label key sets, not by nondeterministic tag ordering, so this avoids changing the metric record path or upgrading OpenTelemetry dependencies.

Tests:

python -m py_compile python/ray/_private/telemetry/open_telemetry_metric_recorder.py python/ray/dashboard/modules/reporter/reporter_agent.py python/ray/dashboard/modules/reporter/tests/test_reporter.py python/ray/tests/test_open_telemetry_metric_recorder.py
git diff --check
@OneSizeFitsQuorum OneSizeFitsQuorum requested a review from a team as a code owner May 31, 2026 04:10

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request normalizes the label schema for OpenTelemetry metrics and histograms before they are exported. It ensures that all observations and data points in a batch share a consistent set of attribute keys by padding missing keys with empty strings, preventing issues with mixed attribute sets. Unit tests have been added to verify this normalization behavior for histograms, gauges, counters, and sums. There are no review comments, so I have no feedback to provide.

@OneSizeFitsQuorum OneSizeFitsQuorum force-pushed the txy/fix-otel-prometheus-label-normalization branch from b398012 to e393b87 Compare May 31, 2026 04:10
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels May 31, 2026
@dancingactor

Copy link
Copy Markdown
Contributor
Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
@OneSizeFitsQuorum OneSizeFitsQuorum force-pushed the txy/fix-otel-prometheus-label-normalization branch from e393b87 to 08806ea Compare June 1, 2026 01:59
@OneSizeFitsQuorum

Copy link
Copy Markdown
Contributor Author

@dancingactor Thanks a lot for reviewing this! Already fixed this!

@OneSizeFitsQuorum

OneSizeFitsQuorum commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

The current microcheck failure appears unrelated to this PR.

Failing job:

ml: train v2 torch trainer tests (torchft image)
//python/ray/train/v2:test_torch_trainer

Failure:

python/ray/train/v2/tests/test_torch_trainer.py::test_torchft_linear_replica_failure
AttributeError: 'Future' object has no attribute '_fut'

The exception comes from torchft/ddp.py. This PR only changes OTel/dashboard metrics export code and related tests.

@dancingactor

Copy link
Copy Markdown
Contributor

Yes, I think it's unrelated too. Maybe you can rebase onto the latest master branch to trigger the CI again. Thanks!

@OneSizeFitsQuorum

Copy link
Copy Markdown
Contributor Author

@dancingactor All passed now!

@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jun 3, 2026
@OneSizeFitsQuorum

Copy link
Copy Markdown
Contributor Author

@dancingactor @edoakes Is there anything I should do before merging?

@edoakes edoakes left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@edoakes edoakes merged commit b9f8c9c into ray-project:master Jun 6, 2026
8 checks passed
@OneSizeFitsQuorum OneSizeFitsQuorum deleted the txy/fix-otel-prometheus-label-normalization branch June 8, 2026 01:22
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
…export (ray-project#63744)

## Description

This PR normalizes OpenTelemetry metric attribute sets before handing
observations to the Prometheus exporter.

Some Ray components can emit the same metric with heterogeneous
attribute sets, for example when one data point includes `SessionName`
and another data point for the same metric does not. With older
`opentelemetry-exporter-prometheus` versions used by Ray's default
compiled dependencies, metrics can reach Prometheus export with mixed
label key sets. This can produce misaligned Prometheus label values,
such as `dataset="core_worker"` or `operator="<node ip>"`, making Ray
Data dashboards misleading.

This change makes Ray enforce a stable label schema at the export
boundary:

- For observable gauge, counter, and sum callbacks, collect the union of
attribute keys for each metric and fill missing values with `""`.
- For reconstructed histogram batches in the dashboard reporter,
normalize all batch data points to the union of tag keys before
recording them.
- Add regression coverage for mixed attribute sets in observable metric
callbacks and histogram export.

This does not depend on upgrading `opentelemetry-exporter-prometheus`.
It is also compatible with newer exporter versions that perform similar
normalization internally; in that case Ray provides already-normalized
observations and the exporter-side normalization is effectively
idempotent.

## Related issues

Fixes ray-project#63499.

## Additional information

This PR intentionally keeps the fix in the Python export path. The issue
is caused by heterogeneous label key sets, not by nondeterministic tag
ordering, so this avoids changing the metric record path or upgrading
OpenTelemetry dependencies.

Tests:

```bash
python -m py_compile python/ray/_private/telemetry/open_telemetry_metric_recorder.py python/ray/dashboard/modules/reporter/reporter_agent.py python/ray/dashboard/modules/reporter/tests/test_reporter.py python/ray/tests/test_open_telemetry_metric_recorder.py
git diff --check

Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

3 participants