Skip to content

[serve,data][llm] LLM telemetry bugfixes#63782

Merged
eicherseiji merged 6 commits into
ray-project:masterfrom
eicherseiji:seiji/llm-telemetry-overhaul
Jun 5, 2026
Merged

[serve,data][llm] LLM telemetry bugfixes#63782
eicherseiji merged 6 commits into
ray-project:masterfrom
eicherseiji:seiji/llm-telemetry-overhaul

Conversation

@eicherseiji

@eicherseiji eicherseiji commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Why are these changes needed?

An audit of the LLM usage telemetry found the per-model statistics are structurally wrong, independent of any particular deployment. The detached telemetry actor appends one entry to its model list per replica engine-start (and again on every restart / autoscale / redeploy) and is never reset, so each comma-joined tag:

  • grows without bound over a long-lived cluster's lifetime (risking silent truncation at the usage-tag value size limit), and
  • duplicates each model by its replica/restart count, inflating every per-model list and count.

Two values were also wrong by construction: num_replicas was hardcoded to 1 for non-autoscaling deployments (and num_replicas="auto" was dropped entirely, since the stored config keeps the literal string), and the JSON_MODE tags were built from a hardcoded use_json_mode=True for a dimension that has no deployment-time config, making them duplicates of the MODELS / NUM_REPLICAS tags. These are observable failure modes, not hypothetical.

Serve (observability/usage_telemetry/usage.py)

  • Dedup by model_id (last-write-wins dict) instead of an append-only list. Fixes the double-counting across replicas/restarts and the unbounded memory growth on the head-node actor; tags now carry one entry per distinct model. model_id is identity only and is never recorded as a value.
  • Report the configured replica count: fixed num_replicas instead of hardcoded 1, and resolve num_replicas="auto" to its autoscaling config (reported as autoscaling) rather than dropping it.
  • Drop the JSON_MODE tags (proto 602/603 reserved); they carried no signal.
  • Telemetry can no longer break engine start: _multiple_apps never raises, and per-model reporting is wrapped so a failure is logged, not propagated.
  • Atomic agent creation via get_if_exists; removed the shadowed retry args.

Batch (batch/observability/usage_telemetry/usage.py + processors)

  • Same dedup fix (key by telemetry identity) plus a _reset hook.
  • Guarded, atomic agent creation and a non-raising push path so telemetry cannot break processor construction.
  • HTTP processor now passes batch_size; vLLM processor now sources data_parallel_size (both previously reported 0).
  • Fixed the LLM_BATCH_CONCURRENCY proto comment.

Tests

Updated existing expectations and added regression tests: replica/restart dedup, fixed num_replicas, and num_replicas="auto" reported as autoscaling.

Note

This reserves two released usage tags (602/603) and stops recording them. Per the proto header, data-collection changes need sign-off from @pcmoritz / @thomasdesr.

Checks

  • Added regression tests.
  • Signed off with DCO.
The LLM usage telemetry accumulated one entry per replica engine-start (and per
restart) in a detached actor that was never reset, so every comma-joined tag
was inflated by replica count and grew unbounded over a cluster's lifetime.
Several values were also wrong or meaningless.

Serve:
- Key the telemetry agent's models by model_id (last-write-wins) instead of an
  append-only list. Fixes double-counting across replicas/restarts and the
  unbounded memory growth; tags now carry one entry per distinct model.
- Report the configured fixed num_replicas instead of hardcoding 1 for
  non-autoscaling deployments.
- Drop the JSON_MODE tags: use_json_mode was hardcoded True and there is no
  deployment-time JSON/structured-output config, so they duplicated
  MODELS/NUM_REPLICAS. Proto entries 602/603 reserved.
- Make telemetry unable to break engine start: _multiple_apps never raises, and
  per-model reporting is wrapped so a failure is logged, not propagated.
- Use get_if_exists for atomic agent creation; drop the shadowed retry args.

Batch:
- Same dedup fix (key tracking by telemetry identity) plus a reset hook.
- Atomic agent creation and a non-raising push path.
- Pass batch_size from the HTTP processor and source data_parallel_size for the
  vLLM processor (both previously reported 0).
- Fix the LLM_BATCH_CONCURRENCY proto comment.

Adds regression tests for dedup and fixed-replica reporting.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the LLM batch and serve telemetry agents to deduplicate reported models and replicas by keying tracking collections on model/telemetry identity instead of accumulating them in lists. It also improves robustness by wrapping telemetry reporting in try-except blocks to prevent telemetry failures from breaking main execution paths, and updates actor creation to be atomic using get_if_exists=True. The feedback suggests further improving robustness by wrapping the remote actor creation in TelemetryAgent in a try-except block to avoid breaking processor construction, and defensively handling non-integer values (such as 'auto' or None) for num_replicas in Ray Serve to prevent Pydantic validation errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/llm/_internal/serve/observability/usage_telemetry/usage.py Outdated
@eicherseiji eicherseiji added the go add ONLY when ready to merge, run all tests label Jun 1, 2026
@eicherseiji eicherseiji changed the title [serve][llm] Make LLM telemetry trustworthy (serve + batch) Jun 1, 2026
@eicherseiji eicherseiji changed the title [serve][llm] LLM telemetry bugfixes (serve + batch) Jun 1, 2026
… agent init

- Serve: num_replicas="auto" is stored verbatim by LLMConfig (the validator
  discards its resolved autoscaling config), so feeding it to the int field
  dropped that model's telemetry. Resolve "auto" via handle_num_replicas_auto
  and report it as autoscaling. Default fixed num_replicas to 1 on None.
- Batch: wrap telemetry agent actor creation so a failed/uninitialized Ray
  cluster can't break processor construction; no-op the push/update paths when
  the agent handle is unavailable.
- Add a regression test for num_replicas="auto".

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji eicherseiji marked this pull request as ready for review June 1, 2026 23:28
@eicherseiji eicherseiji requested review from a team, pcmoritz and thomasdesr as code owners June 1, 2026 23:28
// output is a per-request feature with no deployment-time toggle, so these
// duplicated MODELS/NUM_REPLICAS and are no longer recorded.
reserved 602, 603;
reserved "LLM_SERVE_JSON_MODE_MODELS", "LLM_SERVE_JSON_MODE_NUM_REPLICAS";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto file modified requires RPC fault-tolerance review

Low Severity

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

Fix in Cursor Fix in Web

Triggered by project rule: Bugbot Rules

Reviewed by Cursor Bugbot for commit c507d8d. Configure here.

Comment thread python/ray/llm/_internal/serve/observability/usage_telemetry/usage.py Outdated
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jun 2, 2026
Comment on lines +73 to +75
# Keyed by model_id so repeated reports from replicas/restarts of the
# same model overwrite rather than accumulate.
self.models: Dict[str, TelemetryModel] = {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe overly-cute, but what if we captured the id(model) when you're storing this instead of the actual model value? That way we're just capturing the memory address of the string instead of its original value? It wouldn't be comparable across anything, but would maybe still be semantically meaningful?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like it! I think the string will be recreated with each remote call/deserialization though, right? Used a hash which I think will be equivalent, but lmk if I missed sth

@thomasdesr thomasdesr Jun 3, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash'll work!

A note for the future though: hashes are guessable even when they aren't reversible, and they're comparable across unrelated domains (same input, same hash, everywhere).

Breaking those two properties is over-engineering here, since these values literally don't go anywhere. But if we ever need to, I'd fold two things into a keyed hash (HMAC, or BLAKE3's derive-key mode):

  1. A domain separator: a static key that semi-uniquely identifies this particular use, e.g. ray_telemetry_model_id_masking. That puts the values in their own namespace so they "will never collide with anything else at random."
  2. A source of randomness somewhere (per-process, per-pipeline, per-object) that prevents direct comparison even within the domain.
Key the telemetry dedup dict on a sha256 of model_id rather than the
cleartext name, so the (possibly proprietary) model name never reaches
the detached head-node actor while dedup stays deterministic across
replicas/restarts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji eicherseiji self-assigned this Jun 3, 2026
model_id=model.model_id,
# Hash so the cleartext model name (possibly proprietary) never reaches
# the head-node actor; deterministic across replicas/restarts so dedup holds.
model_id_hash=hashlib.sha256(model.model_id.encode("utf-8")).hexdigest(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we okay with not knowing the exact model from our telemetry? do you know what we use our telemetry for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model_id is high cardinality (user specified) and may contain identifying information, so it's just used as a local dedupe key. Instead we log model arch since it's lower cardinality while retaining some signal about the models used.

Could be interesting to log some more information about the model structure though (e.g. total/active params)

@eicherseiji eicherseiji requested a review from thomasdesr June 3, 2026 16:38

@thomasdesr thomasdesr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping for telemetry review, defer rest to others :D

The record_tag_func closure fired recorder.record.remote() fire-and-forget,
so the telemetry agent's record() returned before the writes landed. The
agent and the test driver are different callers, so Ray gives no ordering
between those writes and the driver's subsequent telemetry read, causing
intermittent KeyError on the recorded tags. ray.get the write so it lands
before the read.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

@jeffreywang88 jeffreywang88 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving nits.

Comment thread python/ray/llm/_internal/batch/observability/usage_telemetry/usage.py Outdated
Comment thread python/ray/llm/_internal/serve/observability/usage_telemetry/usage.py Outdated
Address review nit: hoist HEAD_NODE_RESOURCE_NAME and
PlacementGroupSchedulingStrategy from inline function imports to the top
of batch and serve usage_telemetry modules.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 9484d36. Configure here.

…tity

- serve: report an explicit num_replicas=0 instead of coercing it to 1
  (NonNegativeInt allows 0; distinguish unset from zero).
- batch: add a non-recorded model_id_hash to BatchModelTelemetry so the
  dedup key separates distinct models/endpoints that share an
  architecture and config, instead of collapsing them. Populated from
  model_source (vLLM/SGLang) and target url (HTTP).
- Add regression tests for both.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji eicherseiji merged commit 9a6c58e into ray-project:master Jun 5, 2026
6 checks passed
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jun 30, 2026
## Why are these changes needed?

An audit of the LLM usage telemetry found the per-model statistics are
structurally wrong, independent of any particular deployment. The
detached telemetry actor appends one entry to its model list **per
replica engine-start** (and again on every restart / autoscale /
redeploy) and is never reset, so each comma-joined tag:

- grows **without bound** over a long-lived cluster's lifetime (risking
silent truncation at the usage-tag value size limit), and
- duplicates each model by its replica/restart count, inflating every
per-model list and count.

Two values were also wrong by construction: `num_replicas` was hardcoded
to `1` for non-autoscaling deployments (and `num_replicas="auto"` was
dropped entirely, since the stored config keeps the literal string), and
the JSON_MODE tags were built from a hardcoded `use_json_mode=True` for
a dimension that has no deployment-time config, making them duplicates
of the MODELS / NUM_REPLICAS tags. These are observable failure modes,
not hypothetical.

### Serve (`observability/usage_telemetry/usage.py`)
- **Dedup by `model_id`** (last-write-wins dict) instead of an
append-only list. Fixes the double-counting across replicas/restarts and
the unbounded memory growth on the head-node actor; tags now carry one
entry per distinct model. `model_id` is identity only and is never
recorded as a value.
- **Report the configured replica count**: fixed `num_replicas` instead
of hardcoded `1`, and resolve `num_replicas="auto"` to its autoscaling
config (reported as autoscaling) rather than dropping it.
- **Drop the JSON_MODE tags** (proto 602/603 `reserved`); they carried
no signal.
- **Telemetry can no longer break engine start**: `_multiple_apps` never
raises, and per-model reporting is wrapped so a failure is logged, not
propagated.
- **Atomic agent creation** via `get_if_exists`; removed the shadowed
retry args.

### Batch (`batch/observability/usage_telemetry/usage.py` + processors)
- Same dedup fix (key by telemetry identity) plus a `_reset` hook.
- Guarded, atomic agent creation and a non-raising push path so
telemetry cannot break processor construction.
- HTTP processor now passes `batch_size`; vLLM processor now sources
`data_parallel_size` (both previously reported 0).
- Fixed the `LLM_BATCH_CONCURRENCY` proto comment.

### Tests
Updated existing expectations and added regression tests:
replica/restart dedup, fixed `num_replicas`, and `num_replicas="auto"`
reported as autoscaling.

> [!NOTE]
> This reserves two released usage tags (602/603) and stops recording
them. Per the proto header, data-collection changes need sign-off from
@pcmoritz / @thomasdesr.

## Checks
- [x] Added regression tests.
- [x] Signed off with DCO.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling serve Ray Serve Related Issue

3 participants