[serve][llm] Add telemetry for direct streaming feature by eicherseiji · Pull Request #63779 · ray-project/ray

eicherseiji · 2026-06-01T21:43:06Z

Why are these changes needed?

Adds a LLM_SERVE_DIRECT_STREAMING_ENABLED usage tag so we can track adoption of LLM direct streaming (RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING).

Recorded inline in LLMServer._start_engine (gated on the env var), next to the existing per-model telemetry push. DPServer/PDDecodeServer inherit _start_engine, so the OpenAI, DP, and PD patterns are all covered. Recording replica-side (vs at build time) matches every other usage tag and guarantees GCS is available.

Checks

Signed off with DCO.

Record an LLM_SERVE_DIRECT_STREAMING_ENABLED usage tag when an app is built with RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING. The tag is recorded once at app build time in _build_direct_streaming_llm_deployment, the single chokepoint shared by the OpenAI, data-parallel, and prefill/decode builders, so all direct-streaming serving patterns are covered. Direct streaming is an app-level opt-in rather than a per-model property, so it is recorded directly via record_extra_usage_tag instead of going through the per-model TelemetryAgent. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces telemetry tracking for whether LLM direct streaming (engine-native ASGI ingress) is enabled. It adds the LLM_SERVE_DIRECT_STREAMING_ENABLED tag to the usage protobuf and registers it during the application build process. The review feedback suggests catching a broader Exception instead of only ValueError when recording the telemetry tag to ensure that unexpected telemetry errors do not crash the deployment or build process.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

record_extra_usage_tag is already best-effort (no-ops before ray init, swallows GCS write errors internally), so the only thing that can escape into this code is TagKey.Value() raising on a not-yet-regenerated usage proto. Keep the catch narrow to that ValueError so genuine bugs in the recording call still surface, and log the benign skip at debug instead of swallowing silently. Responds to review feedback on ray-project#63779. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Move the LLM_SERVE_DIRECT_STREAMING_ENABLED record call out of the builder and into LLMServer._start_engine, gated on RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING, next to the existing per-model telemetry push. Usage tags are last-write-wins state snapshots, so build-time vs replica-time report the same value; recording on the replica matches every other usage tag (core Serve and LLM), guarantees GCS is available (build-only paths like serve build would silently no-op), and reflects direct streaming actually running rather than merely configured. DPServer/PDDecodeServer inherit _start_engine, so all three serving patterns stay covered. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Condense the docstring and inline comment to terse one/two-liners matching the rest of the usage module; the detailed rationale lives in the commit history and PR description. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Drop the try/except and the TelemetryTags indirection; reference the proto TagKey member directly like ServeUsageTag.record, which removes the only failure mode the catch guarded. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Drop the one-line record_direct_streaming_enabled helper and record the tag directly in LLMServer._start_engine, matching core Serve's ServeUsageTag one-liner style. Removes the now-orphaned helper and its unit test. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

The tag is a cluster-wide signal: written per replica on engine start but last-write-wins, so it reports one value per cluster. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit f215435. Configure here.}

cursor · 2026-06-01T22:52:58Z

  LLM_SERVE_NUM_GPUS = 613;
+  // Whether LLM direct streaming (engine-native ASGI ingress) is enabled via
+  // RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING. "1" when enabled.
+  LLM_SERVE_DIRECT_STREAMING_ENABLED = 623;


Proto file modified — review RPC fault-tolerance guide

Low Severity

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

^{Triggered by project rule: Bugbot Rules}

^{Reviewed by Cursor Bugbot for commit f215435. Configure here.}

…63779) ## Why are these changes needed? Adds a `LLM_SERVE_DIRECT_STREAMING_ENABLED` usage tag so we can track adoption of LLM direct streaming (`RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING`). Recorded inline in `LLMServer._start_engine` (gated on the env var), next to the existing per-model telemetry push. `DPServer`/`PDDecodeServer` inherit `_start_engine`, so the OpenAI, DP, and PD patterns are all covered. Recording replica-side (vs at build time) matches every other usage tag and guarantees GCS is available. ## Checks - [x] Signed off with DCO. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/observability/usage_telemetry/usage.py Outdated

eicherseiji added 4 commits June 1, 2026 22:15

[serve][llm] Simplify direct streaming telemetry helper

393ee6d

Drop the try/except and the TelemetryTags indirection; reference the proto TagKey member directly like ServeUsageTag.record, which removes the only failure mode the catch guarded. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji added the go add ONLY when ready to merge, run all tests label Jun 1, 2026

eicherseiji added 2 commits June 1, 2026 22:47

[serve][llm] Clarify direct streaming telemetry comment

f215435

The tag is a cluster-wide signal: written per replica on engine start but last-write-wins, so it reports one value per cluster. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

eicherseiji marked this pull request as ready for review June 1, 2026 22:50

eicherseiji requested review from a team, pcmoritz and thomasdesr as code owners June 1, 2026 22:50

cursor Bot reviewed Jun 1, 2026

View reviewed changes

jeffreywang88 approved these changes Jun 1, 2026

View reviewed changes

ray-gardener Bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jun 2, 2026

thomasdesr approved these changes Jun 3, 2026

View reviewed changes

eicherseiji merged commit 66ac79d into ray-project:master Jun 3, 2026
8 checks passed

eicherseiji mentioned this pull request Jun 5, 2026

[doc][serve][llm] Add direct streaming user guide #63891

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[serve][llm] Add telemetry for direct streaming feature#63779

[serve][llm] Add telemetry for direct streaming feature#63779
eicherseiji merged 7 commits into
ray-project:masterfrom
eicherseiji:seiji/llm-direct-streaming-telemetry

eicherseiji commented Jun 1, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

cursor Bot left a comment

cursor Bot Jun 1, 2026

Uh oh!

Labels

3 participants

Uh oh!

Conversation

eicherseiji commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Checks

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot Jun 1, 2026

Choose a reason for hiding this comment

Proto file modified — review RPC fault-tolerance guide

Uh oh!

Labels

3 participants

eicherseiji commented Jun 1, 2026 •

edited

Loading