[serve,data][llm] LLM telemetry bugfixes by eicherseiji · Pull Request #63782 · ray-project/ray

eicherseiji · 2026-06-01T23:11:57Z

Why are these changes needed?

An audit of the LLM usage telemetry found the per-model statistics are structurally wrong, independent of any particular deployment. The detached telemetry actor appends one entry to its model list per replica engine-start (and again on every restart / autoscale / redeploy) and is never reset, so each comma-joined tag:

grows without bound over a long-lived cluster's lifetime (risking silent truncation at the usage-tag value size limit), and
duplicates each model by its replica/restart count, inflating every per-model list and count.

Two values were also wrong by construction: num_replicas was hardcoded to 1 for non-autoscaling deployments (and num_replicas="auto" was dropped entirely, since the stored config keeps the literal string), and the JSON_MODE tags were built from a hardcoded use_json_mode=True for a dimension that has no deployment-time config, making them duplicates of the MODELS / NUM_REPLICAS tags. These are observable failure modes, not hypothetical.

Serve (`observability/usage_telemetry/usage.py`)

Dedup by model_id (last-write-wins dict) instead of an append-only list. Fixes the double-counting across replicas/restarts and the unbounded memory growth on the head-node actor; tags now carry one entry per distinct model. model_id is identity only and is never recorded as a value.
Report the configured replica count: fixed num_replicas instead of hardcoded 1, and resolve num_replicas="auto" to its autoscaling config (reported as autoscaling) rather than dropping it.
Drop the JSON_MODE tags (proto 602/603 reserved); they carried no signal.
Telemetry can no longer break engine start: _multiple_apps never raises, and per-model reporting is wrapped so a failure is logged, not propagated.
Atomic agent creation via get_if_exists; removed the shadowed retry args.

Batch (`batch/observability/usage_telemetry/usage.py` + processors)

Same dedup fix (key by telemetry identity) plus a _reset hook.
Guarded, atomic agent creation and a non-raising push path so telemetry cannot break processor construction.
HTTP processor now passes batch_size; vLLM processor now sources data_parallel_size (both previously reported 0).
Fixed the LLM_BATCH_CONCURRENCY proto comment.

Tests

Updated existing expectations and added regression tests: replica/restart dedup, fixed num_replicas, and num_replicas="auto" reported as autoscaling.

Note

This reserves two released usage tags (602/603) and stops recording them. Per the proto header, data-collection changes need sign-off from @pcmoritz / @thomasdesr.

Checks

Added regression tests.
Signed off with DCO.

The LLM usage telemetry accumulated one entry per replica engine-start (and per restart) in a detached actor that was never reset, so every comma-joined tag was inflated by replica count and grew unbounded over a cluster's lifetime. Several values were also wrong or meaningless. Serve: - Key the telemetry agent's models by model_id (last-write-wins) instead of an append-only list. Fixes double-counting across replicas/restarts and the unbounded memory growth; tags now carry one entry per distinct model. - Report the configured fixed num_replicas instead of hardcoding 1 for non-autoscaling deployments. - Drop the JSON_MODE tags: use_json_mode was hardcoded True and there is no deployment-time JSON/structured-output config, so they duplicated MODELS/NUM_REPLICAS. Proto entries 602/603 reserved. - Make telemetry unable to break engine start: _multiple_apps never raises, and per-model reporting is wrapped so a failure is logged, not propagated. - Use get_if_exists for atomic agent creation; drop the shadowed retry args. Batch: - Same dedup fix (key tracking by telemetry identity) plus a reset hook. - Atomic agent creation and a non-raising push path. - Pass batch_size from the HTTP processor and source data_parallel_size for the vLLM processor (both previously reported 0). - Fix the LLM_BATCH_CONCURRENCY proto comment. Adds regression tests for dedup and fixed-replica reporting. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

gemini-code-assist

Code Review

This pull request refactors the LLM batch and serve telemetry agents to deduplicate reported models and replicas by keying tracking collections on model/telemetry identity instead of accumulating them in lists. It also improves robustness by wrapping telemetry reporting in try-except blocks to prevent telemetry failures from breaking main execution paths, and updates actor creation to be atomic using get_if_exists=True. The feedback suggests further improving robustness by wrapping the remote actor creation in TelemetryAgent in a try-except block to avoid breaking processor construction, and defensively handling non-integer values (such as 'auto' or None) for num_replicas in Ray Serve to prevent Pydantic validation errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

… agent init - Serve: num_replicas="auto" is stored verbatim by LLMConfig (the validator discards its resolved autoscaling config), so feeding it to the int field dropped that model's telemetry. Resolve "auto" via handle_num_replicas_auto and report it as autoscaling. Default fixed num_replicas to 1 on None. - Batch: wrap telemetry agent actor creation so a failed/uninitialized Ray cluster can't break processor construction; no-op the push/update paths when the agent handle is unavailable. - Add a regression test for num_replicas="auto". Signed-off-by: Seiji Eicher <seiji@anyscale.com>

cursor · 2026-06-01T23:45:58Z

+  // output is a per-request feature with no deployment-time toggle, so these
+  // duplicated MODELS/NUM_REPLICAS and are no longer recorded.
+  reserved 602, 603;
+  reserved "LLM_SERVE_JSON_MODE_MODELS", "LLM_SERVE_JSON_MODE_NUM_REPLICAS";


Proto file modified requires RPC fault-tolerance review

Low Severity

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

^{Triggered by project rule: Bugbot Rules}

^{Reviewed by Cursor Bugbot for commit c507d8d. Configure here.}

thomasdesr · 2026-06-03T00:50:17Z

+        # Keyed by model_id so repeated reports from replicas/restarts of the
+        # same model overwrite rather than accumulate.
+        self.models: Dict[str, TelemetryModel] = {}


Maybe overly-cute, but what if we captured the id(model) when you're storing this instead of the actual model value? That way we're just capturing the memory address of the string instead of its original value? It wouldn't be comparable across anything, but would maybe still be semantically meaningful?

Like it! I think the string will be recreated with each remote call/deserialization though, right? Used a hash which I think will be equivalent, but lmk if I missed sth

Hash'll work!

A note for the future though: hashes are guessable even when they aren't reversible, and they're comparable across unrelated domains (same input, same hash, everywhere).

Breaking those two properties is over-engineering here, since these values literally don't go anywhere. But if we ever need to, I'd fold two things into a keyed hash (HMAC, or BLAKE3's derive-key mode):

A domain separator: a static key that semi-uniquely identifies this particular use, e.g. ray_telemetry_model_id_masking. That puts the values in their own namespace so they "will never collide with anything else at random."

A source of randomness somewhere (per-process, per-pipeline, per-object) that prevents direct comparison even within the domain.

Key the telemetry dedup dict on a sha256 of model_id rather than the cleartext name, so the (possibly proprietary) model name never reaches the detached head-node actor while dedup stays deterministic across replicas/restarts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>

jeffreywang88 · 2026-06-03T05:46:46Z

-        model_id=model.model_id,
+        # Hash so the cleartext model name (possibly proprietary) never reaches
+        # the head-node actor; deterministic across replicas/restarts so dedup holds.
+        model_id_hash=hashlib.sha256(model.model_id.encode("utf-8")).hexdigest(),


are we okay with not knowing the exact model from our telemetry? do you know what we use our telemetry for?

The model_id is high cardinality (user specified) and may contain identifying information, so it's just used as a local dedupe key. Instead we log model arch since it's lower cardinality while retaining some signal about the models used.

Could be interesting to log some more information about the model structure though (e.g. total/active params)

thomasdesr

Stamping for telemetry review, defer rest to others :D

The record_tag_func closure fired recorder.record.remote() fire-and-forget, so the telemetry agent's record() returned before the writes landed. The agent and the test driver are different callers, so Ray gives no ordering between those writes and the driver's subsequent telemetry read, causing intermittent KeyError on the recorded tags. ray.get the write so it lands before the read. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

jeffreywang88

Just leaving nits.

Address review nit: hoist HEAD_NODE_RESOURCE_NAME and PlacementGroupSchedulingStrategy from inline function imports to the top of batch and serve usage_telemetry modules. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{Reviewed by Cursor Bugbot for commit 9484d36. Configure here.}

…tity - serve: report an explicit num_replicas=0 instead of coercing it to 1 (NonNegativeInt allows 0; distinguish unset from zero). - batch: add a non-recorded model_id_hash to BatchModelTelemetry so the dedup key separates distinct models/endpoints that share an architecture and config, instead of collapsing them. Populated from model_source (vLLM/SGLang) and target url (HTTP). - Add regression tests for both. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

@pcmoritz

## Why are these changes needed? An audit of the LLM usage telemetry found the per-model statistics are structurally wrong, independent of any particular deployment. The detached telemetry actor appends one entry to its model list **per replica engine-start** (and again on every restart / autoscale / redeploy) and is never reset, so each comma-joined tag: - grows **without bound** over a long-lived cluster's lifetime (risking silent truncation at the usage-tag value size limit), and - duplicates each model by its replica/restart count, inflating every per-model list and count. Two values were also wrong by construction: `num_replicas` was hardcoded to `1` for non-autoscaling deployments (and `num_replicas="auto"` was dropped entirely, since the stored config keeps the literal string), and the JSON_MODE tags were built from a hardcoded `use_json_mode=True` for a dimension that has no deployment-time config, making them duplicates of the MODELS / NUM_REPLICAS tags. These are observable failure modes, not hypothetical. ### Serve (`observability/usage_telemetry/usage.py`) - **Dedup by `model_id`** (last-write-wins dict) instead of an append-only list. Fixes the double-counting across replicas/restarts and the unbounded memory growth on the head-node actor; tags now carry one entry per distinct model. `model_id` is identity only and is never recorded as a value. - **Report the configured replica count**: fixed `num_replicas` instead of hardcoded `1`, and resolve `num_replicas="auto"` to its autoscaling config (reported as autoscaling) rather than dropping it. - **Drop the JSON_MODE tags** (proto 602/603 `reserved`); they carried no signal. - **Telemetry can no longer break engine start**: `_multiple_apps` never raises, and per-model reporting is wrapped so a failure is logged, not propagated. - **Atomic agent creation** via `get_if_exists`; removed the shadowed retry args. ### Batch (`batch/observability/usage_telemetry/usage.py` + processors) - Same dedup fix (key by telemetry identity) plus a `_reset` hook. - Guarded, atomic agent creation and a non-raising push path so telemetry cannot break processor construction. - HTTP processor now passes `batch_size`; vLLM processor now sources `data_parallel_size` (both previously reported 0). - Fixed the `LLM_BATCH_CONCURRENCY` proto comment. ### Tests Updated existing expectations and added regression tests: replica/restart dedup, fixed `num_replicas`, and `num_replicas="auto"` reported as autoscaling. > [!NOTE] > This reserves two released usage tags (602/603) and stops recording them. Per the proto header, data-collection changes need sign-off from @pcmoritz / @thomasdesr. ## Checks - [x] Added regression tests. - [x] Signed off with DCO. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/batch/observability/usage_telemetry/usage.py

Comment thread python/ray/llm/_internal/serve/observability/usage_telemetry/usage.py Outdated

eicherseiji added the go add ONLY when ready to merge, run all tests label Jun 1, 2026

eicherseiji changed the title ~~[serve][llm] Make LLM telemetry trustworthy (serve + batch)~~ Jun 1, 2026

eicherseiji changed the title ~~[serve][llm] LLM telemetry bugfixes (serve + batch)~~ Jun 1, 2026

eicherseiji marked this pull request as ready for review June 1, 2026 23:28

eicherseiji requested review from a team, pcmoritz and thomasdesr as code owners June 1, 2026 23:28

cursor Bot reviewed Jun 1, 2026

View reviewed changes

ray-gardener Bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jun 2, 2026

thomasdesr reviewed Jun 3, 2026

View reviewed changes

eicherseiji self-assigned this Jun 3, 2026

jeffreywang88 reviewed Jun 3, 2026

View reviewed changes

eicherseiji requested a review from thomasdesr June 3, 2026 16:38

thomasdesr approved these changes Jun 3, 2026

View reviewed changes

jeffreywang88 approved these changes Jun 4, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/batch/observability/usage_telemetry/usage.py Outdated

Comment thread python/ray/llm/_internal/serve/observability/usage_telemetry/usage.py Outdated

[serve,data][llm] Move telemetry agent imports to module top

9484d36

Address review nit: hoist HEAD_NODE_RESOURCE_NAME and PlacementGroupSchedulingStrategy from inline function imports to the top of batch and serve usage_telemetry modules. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/batch/observability/usage_telemetry/usage.py

eicherseiji merged commit 9a6c58e into ray-project:master Jun 5, 2026
6 checks passed

eicherseiji mentioned this pull request Jun 9, 2026

[serve][llm] Expand LLM serve telemetry: deploy outcome, serving pattern, engine config #63978

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[serve,data][llm] LLM telemetry bugfixes#63782

[serve,data][llm] LLM telemetry bugfixes#63782
eicherseiji merged 6 commits into
ray-project:masterfrom
eicherseiji:seiji/llm-telemetry-overhaul

eicherseiji commented Jun 1, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

cursor Bot Jun 1, 2026

Uh oh!

thomasdesr Jun 3, 2026

eicherseiji Jun 3, 2026

thomasdesr Jun 3, 2026 •

edited

Loading

jeffreywang88 Jun 3, 2026

eicherseiji Jun 3, 2026

thomasdesr left a comment

jeffreywang88 left a comment

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Labels

3 participants

Uh oh!

Conversation

eicherseiji commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Serve (observability/usage_telemetry/usage.py)

Batch (batch/observability/usage_telemetry/usage.py + processors)

Tests

Checks

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cursor Bot Jun 1, 2026

Choose a reason for hiding this comment

Proto file modified requires RPC fault-tolerance review

Uh oh!

thomasdesr Jun 3, 2026

Choose a reason for hiding this comment

eicherseiji Jun 3, 2026

Choose a reason for hiding this comment

thomasdesr Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

jeffreywang88 Jun 3, 2026

Choose a reason for hiding this comment

eicherseiji Jun 3, 2026

Choose a reason for hiding this comment

thomasdesr left a comment

Choose a reason for hiding this comment

jeffreywang88 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

3 participants

eicherseiji commented Jun 1, 2026 •

edited

Loading

Serve (`observability/usage_telemetry/usage.py`)

Batch (`batch/observability/usage_telemetry/usage.py` + processors)

thomasdesr Jun 3, 2026 •

edited

Loading