[core][observability] Normalize OTel metric labels before Prometheus export#63744
Conversation
There was a problem hiding this comment.
Code Review
This pull request normalizes the label schema for OpenTelemetry metrics and histograms before they are exported. It ensures that all observations and data points in a batch share a consistent set of attribute keys by padding missing keys with empty strings, preventing issues with mixed attribute sets. Unit tests have been added to verify this normalization behavior for histograms, gauges, counters, and sums. There are no review comments, so I have no feedback to provide.
b398012 to
e393b87
Compare
|
Hi @OneSizeFitsQuorum, thanks for contribution! Could you please fix the linter, thanks! |
Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
e393b87 to
08806ea
Compare
|
@dancingactor Thanks a lot for reviewing this! Already fixed this! |
|
The current Failing job: Failure: The exception comes from |
|
Yes, I think it's unrelated too. Maybe you can rebase onto the latest master branch to trigger the CI again. Thanks! |
|
@dancingactor All passed now! |
|
@dancingactor @edoakes Is there anything I should do before merging? |
…export (ray-project#63744) ## Description This PR normalizes OpenTelemetry metric attribute sets before handing observations to the Prometheus exporter. Some Ray components can emit the same metric with heterogeneous attribute sets, for example when one data point includes `SessionName` and another data point for the same metric does not. With older `opentelemetry-exporter-prometheus` versions used by Ray's default compiled dependencies, metrics can reach Prometheus export with mixed label key sets. This can produce misaligned Prometheus label values, such as `dataset="core_worker"` or `operator="<node ip>"`, making Ray Data dashboards misleading. This change makes Ray enforce a stable label schema at the export boundary: - For observable gauge, counter, and sum callbacks, collect the union of attribute keys for each metric and fill missing values with `""`. - For reconstructed histogram batches in the dashboard reporter, normalize all batch data points to the union of tag keys before recording them. - Add regression coverage for mixed attribute sets in observable metric callbacks and histogram export. This does not depend on upgrading `opentelemetry-exporter-prometheus`. It is also compatible with newer exporter versions that perform similar normalization internally; in that case Ray provides already-normalized observations and the exporter-side normalization is effectively idempotent. ## Related issues Fixes ray-project#63499. ## Additional information This PR intentionally keeps the fix in the Python export path. The issue is caused by heterogeneous label key sets, not by nondeterministic tag ordering, so this avoids changing the metric record path or upgrading OpenTelemetry dependencies. Tests: ```bash python -m py_compile python/ray/_private/telemetry/open_telemetry_metric_recorder.py python/ray/dashboard/modules/reporter/reporter_agent.py python/ray/dashboard/modules/reporter/tests/test_reporter.py python/ray/tests/test_open_telemetry_metric_recorder.py git diff --check Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
Description
This PR normalizes OpenTelemetry metric attribute sets before handing observations to the Prometheus exporter.
Some Ray components can emit the same metric with heterogeneous attribute sets, for example when one data point includes
SessionNameand another data point for the same metric does not. With olderopentelemetry-exporter-prometheusversions used by Ray's default compiled dependencies, metrics can reach Prometheus export with mixed label key sets. This can produce misaligned Prometheus label values, such asdataset="core_worker"oroperator="<node ip>", making Ray Data dashboards misleading.This change makes Ray enforce a stable label schema at the export boundary:
"".This does not depend on upgrading
opentelemetry-exporter-prometheus. It is also compatible with newer exporter versions that perform similar normalization internally; in that case Ray provides already-normalized observations and the exporter-side normalization is effectively idempotent.Related issues
Fixes #63499.
Additional information
This PR intentionally keeps the fix in the Python export path. The issue is caused by heterogeneous label key sets, not by nondeterministic tag ordering, so this avoids changing the metric record path or upgrading OpenTelemetry dependencies.
Tests: