[serve,data][llm] LLM telemetry bugfixes#63782
Conversation
The LLM usage telemetry accumulated one entry per replica engine-start (and per restart) in a detached actor that was never reset, so every comma-joined tag was inflated by replica count and grew unbounded over a cluster's lifetime. Several values were also wrong or meaningless. Serve: - Key the telemetry agent's models by model_id (last-write-wins) instead of an append-only list. Fixes double-counting across replicas/restarts and the unbounded memory growth; tags now carry one entry per distinct model. - Report the configured fixed num_replicas instead of hardcoding 1 for non-autoscaling deployments. - Drop the JSON_MODE tags: use_json_mode was hardcoded True and there is no deployment-time JSON/structured-output config, so they duplicated MODELS/NUM_REPLICAS. Proto entries 602/603 reserved. - Make telemetry unable to break engine start: _multiple_apps never raises, and per-model reporting is wrapped so a failure is logged, not propagated. - Use get_if_exists for atomic agent creation; drop the shadowed retry args. Batch: - Same dedup fix (key tracking by telemetry identity) plus a reset hook. - Atomic agent creation and a non-raising push path. - Pass batch_size from the HTTP processor and source data_parallel_size for the vLLM processor (both previously reported 0). - Fix the LLM_BATCH_CONCURRENCY proto comment. Adds regression tests for dedup and fixed-replica reporting. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the LLM batch and serve telemetry agents to deduplicate reported models and replicas by keying tracking collections on model/telemetry identity instead of accumulating them in lists. It also improves robustness by wrapping telemetry reporting in try-except blocks to prevent telemetry failures from breaking main execution paths, and updates actor creation to be atomic using get_if_exists=True. The feedback suggests further improving robustness by wrapping the remote actor creation in TelemetryAgent in a try-except block to avoid breaking processor construction, and defensively handling non-integer values (such as 'auto' or None) for num_replicas in Ray Serve to prevent Pydantic validation errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
… agent init - Serve: num_replicas="auto" is stored verbatim by LLMConfig (the validator discards its resolved autoscaling config), so feeding it to the int field dropped that model's telemetry. Resolve "auto" via handle_num_replicas_auto and report it as autoscaling. Default fixed num_replicas to 1 on None. - Batch: wrap telemetry agent actor creation so a failed/uninitialized Ray cluster can't break processor construction; no-op the push/update paths when the agent handle is unavailable. - Add a regression test for num_replicas="auto". Signed-off-by: Seiji Eicher <seiji@anyscale.com>
| // output is a per-request feature with no deployment-time toggle, so these | ||
| // duplicated MODELS/NUM_REPLICAS and are no longer recorded. | ||
| reserved 602, 603; | ||
| reserved "LLM_SERVE_JSON_MODE_MODELS", "LLM_SERVE_JSON_MODE_NUM_REPLICAS"; |
There was a problem hiding this comment.
Proto file modified requires RPC fault-tolerance review
Low Severity
.proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst
Triggered by project rule: Bugbot Rules
Reviewed by Cursor Bugbot for commit c507d8d. Configure here.
| # Keyed by model_id so repeated reports from replicas/restarts of the | ||
| # same model overwrite rather than accumulate. | ||
| self.models: Dict[str, TelemetryModel] = {} |
There was a problem hiding this comment.
Maybe overly-cute, but what if we captured the id(model) when you're storing this instead of the actual model value? That way we're just capturing the memory address of the string instead of its original value? It wouldn't be comparable across anything, but would maybe still be semantically meaningful?
There was a problem hiding this comment.
Like it! I think the string will be recreated with each remote call/deserialization though, right? Used a hash which I think will be equivalent, but lmk if I missed sth
There was a problem hiding this comment.
Hash'll work!
A note for the future though: hashes are guessable even when they aren't reversible, and they're comparable across unrelated domains (same input, same hash, everywhere).
Breaking those two properties is over-engineering here, since these values literally don't go anywhere. But if we ever need to, I'd fold two things into a keyed hash (HMAC, or BLAKE3's derive-key mode):
- A domain separator: a static key that semi-uniquely identifies this particular use, e.g.
ray_telemetry_model_id_masking. That puts the values in their own namespace so they "will never collide with anything else at random." - A source of randomness somewhere (per-process, per-pipeline, per-object) that prevents direct comparison even within the domain.
Key the telemetry dedup dict on a sha256 of model_id rather than the cleartext name, so the (possibly proprietary) model name never reaches the detached head-node actor while dedup stays deterministic across replicas/restarts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
| model_id=model.model_id, | ||
| # Hash so the cleartext model name (possibly proprietary) never reaches | ||
| # the head-node actor; deterministic across replicas/restarts so dedup holds. | ||
| model_id_hash=hashlib.sha256(model.model_id.encode("utf-8")).hexdigest(), |
There was a problem hiding this comment.
are we okay with not knowing the exact model from our telemetry? do you know what we use our telemetry for?
There was a problem hiding this comment.
The model_id is high cardinality (user specified) and may contain identifying information, so it's just used as a local dedupe key. Instead we log model arch since it's lower cardinality while retaining some signal about the models used.
Could be interesting to log some more information about the model structure though (e.g. total/active params)
thomasdesr
left a comment
There was a problem hiding this comment.
Stamping for telemetry review, defer rest to others :D
The record_tag_func closure fired recorder.record.remote() fire-and-forget, so the telemetry agent's record() returned before the writes landed. The agent and the test driver are different callers, so Ray gives no ordering between those writes and the driver's subsequent telemetry read, causing intermittent KeyError on the recorded tags. ray.get the write so it lands before the read. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
jeffreywang88
left a comment
There was a problem hiding this comment.
Just leaving nits.
Address review nit: hoist HEAD_NODE_RESOURCE_NAME and PlacementGroupSchedulingStrategy from inline function imports to the top of batch and serve usage_telemetry modules. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
Reviewed by Cursor Bugbot for commit 9484d36. Configure here.
…tity - serve: report an explicit num_replicas=0 instead of coercing it to 1 (NonNegativeInt allows 0; distinguish unset from zero). - batch: add a non-recorded model_id_hash to BatchModelTelemetry so the dedup key separates distinct models/endpoints that share an architecture and config, instead of collapsing them. Populated from model_source (vLLM/SGLang) and target url (HTTP). - Add regression tests for both. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Why are these changes needed? An audit of the LLM usage telemetry found the per-model statistics are structurally wrong, independent of any particular deployment. The detached telemetry actor appends one entry to its model list **per replica engine-start** (and again on every restart / autoscale / redeploy) and is never reset, so each comma-joined tag: - grows **without bound** over a long-lived cluster's lifetime (risking silent truncation at the usage-tag value size limit), and - duplicates each model by its replica/restart count, inflating every per-model list and count. Two values were also wrong by construction: `num_replicas` was hardcoded to `1` for non-autoscaling deployments (and `num_replicas="auto"` was dropped entirely, since the stored config keeps the literal string), and the JSON_MODE tags were built from a hardcoded `use_json_mode=True` for a dimension that has no deployment-time config, making them duplicates of the MODELS / NUM_REPLICAS tags. These are observable failure modes, not hypothetical. ### Serve (`observability/usage_telemetry/usage.py`) - **Dedup by `model_id`** (last-write-wins dict) instead of an append-only list. Fixes the double-counting across replicas/restarts and the unbounded memory growth on the head-node actor; tags now carry one entry per distinct model. `model_id` is identity only and is never recorded as a value. - **Report the configured replica count**: fixed `num_replicas` instead of hardcoded `1`, and resolve `num_replicas="auto"` to its autoscaling config (reported as autoscaling) rather than dropping it. - **Drop the JSON_MODE tags** (proto 602/603 `reserved`); they carried no signal. - **Telemetry can no longer break engine start**: `_multiple_apps` never raises, and per-model reporting is wrapped so a failure is logged, not propagated. - **Atomic agent creation** via `get_if_exists`; removed the shadowed retry args. ### Batch (`batch/observability/usage_telemetry/usage.py` + processors) - Same dedup fix (key by telemetry identity) plus a `_reset` hook. - Guarded, atomic agent creation and a non-raising push path so telemetry cannot break processor construction. - HTTP processor now passes `batch_size`; vLLM processor now sources `data_parallel_size` (both previously reported 0). - Fixed the `LLM_BATCH_CONCURRENCY` proto comment. ### Tests Updated existing expectations and added regression tests: replica/restart dedup, fixed `num_replicas`, and `num_replicas="auto"` reported as autoscaling. > [!NOTE] > This reserves two released usage tags (602/603) and stops recording them. Per the proto header, data-collection changes need sign-off from @pcmoritz / @thomasdesr. ## Checks - [x] Added regression tests. - [x] Signed off with DCO. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


Why are these changes needed?
An audit of the LLM usage telemetry found the per-model statistics are structurally wrong, independent of any particular deployment. The detached telemetry actor appends one entry to its model list per replica engine-start (and again on every restart / autoscale / redeploy) and is never reset, so each comma-joined tag:
Two values were also wrong by construction:
num_replicaswas hardcoded to1for non-autoscaling deployments (andnum_replicas="auto"was dropped entirely, since the stored config keeps the literal string), and the JSON_MODE tags were built from a hardcodeduse_json_mode=Truefor a dimension that has no deployment-time config, making them duplicates of the MODELS / NUM_REPLICAS tags. These are observable failure modes, not hypothetical.Serve (
observability/usage_telemetry/usage.py)model_id(last-write-wins dict) instead of an append-only list. Fixes the double-counting across replicas/restarts and the unbounded memory growth on the head-node actor; tags now carry one entry per distinct model.model_idis identity only and is never recorded as a value.num_replicasinstead of hardcoded1, and resolvenum_replicas="auto"to its autoscaling config (reported as autoscaling) rather than dropping it.reserved); they carried no signal._multiple_appsnever raises, and per-model reporting is wrapped so a failure is logged, not propagated.get_if_exists; removed the shadowed retry args.Batch (
batch/observability/usage_telemetry/usage.py+ processors)_resethook.batch_size; vLLM processor now sourcesdata_parallel_size(both previously reported 0).LLM_BATCH_CONCURRENCYproto comment.Tests
Updated existing expectations and added regression tests: replica/restart dedup, fixed
num_replicas, andnum_replicas="auto"reported as autoscaling.Note
This reserves two released usage tags (602/603) and stops recording them. Per the proto header, data-collection changes need sign-off from @pcmoritz / @thomasdesr.
Checks