[serve][llm] Add telemetry for direct streaming feature#63779
Conversation
Record an LLM_SERVE_DIRECT_STREAMING_ENABLED usage tag when an app is built with RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING. The tag is recorded once at app build time in _build_direct_streaming_llm_deployment, the single chokepoint shared by the OpenAI, data-parallel, and prefill/decode builders, so all direct-streaming serving patterns are covered. Direct streaming is an app-level opt-in rather than a per-model property, so it is recorded directly via record_extra_usage_tag instead of going through the per-model TelemetryAgent. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces telemetry tracking for whether LLM direct streaming (engine-native ASGI ingress) is enabled. It adds the LLM_SERVE_DIRECT_STREAMING_ENABLED tag to the usage protobuf and registers it during the application build process. The review feedback suggests catching a broader Exception instead of only ValueError when recording the telemetry tag to ensure that unexpected telemetry errors do not crash the deployment or build process.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
record_extra_usage_tag is already best-effort (no-ops before ray init, swallows GCS write errors internally), so the only thing that can escape into this code is TagKey.Value() raising on a not-yet-regenerated usage proto. Keep the catch narrow to that ValueError so genuine bugs in the recording call still surface, and log the benign skip at debug instead of swallowing silently. Responds to review feedback on ray-project#63779. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Move the LLM_SERVE_DIRECT_STREAMING_ENABLED record call out of the builder and into LLMServer._start_engine, gated on RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING, next to the existing per-model telemetry push. Usage tags are last-write-wins state snapshots, so build-time vs replica-time report the same value; recording on the replica matches every other usage tag (core Serve and LLM), guarantees GCS is available (build-only paths like serve build would silently no-op), and reflects direct streaming actually running rather than merely configured. DPServer/PDDecodeServer inherit _start_engine, so all three serving patterns stay covered. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Condense the docstring and inline comment to terse one/two-liners matching the rest of the usage module; the detailed rationale lives in the commit history and PR description. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Drop the try/except and the TelemetryTags indirection; reference the proto TagKey member directly like ServeUsageTag.record, which removes the only failure mode the catch guarded. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Drop the one-line record_direct_streaming_enabled helper and record the tag directly in LLMServer._start_engine, matching core Serve's ServeUsageTag one-liner style. Removes the now-orphaned helper and its unit test. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
The tag is a cluster-wide signal: written per replica on engine start but last-write-wins, so it reports one value per cluster. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit f215435. Configure here.
| LLM_SERVE_NUM_GPUS = 613; | ||
| // Whether LLM direct streaming (engine-native ASGI ingress) is enabled via | ||
| // RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING. "1" when enabled. | ||
| LLM_SERVE_DIRECT_STREAMING_ENABLED = 623; |
There was a problem hiding this comment.
Proto file modified — review RPC fault-tolerance guide
Low Severity
.proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst
Triggered by project rule: Bugbot Rules
Reviewed by Cursor Bugbot for commit f215435. Configure here.
…63779) ## Why are these changes needed? Adds a `LLM_SERVE_DIRECT_STREAMING_ENABLED` usage tag so we can track adoption of LLM direct streaming (`RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING`). Recorded inline in `LLMServer._start_engine` (gated on the env var), next to the existing per-model telemetry push. `DPServer`/`PDDecodeServer` inherit `_start_engine`, so the OpenAI, DP, and PD patterns are all covered. Recording replica-side (vs at build time) matches every other usage tag and guarantees GCS is available. ## Checks - [x] Signed off with DCO. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…63779) ## Why are these changes needed? Adds a `LLM_SERVE_DIRECT_STREAMING_ENABLED` usage tag so we can track adoption of LLM direct streaming (`RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING`). Recorded inline in `LLMServer._start_engine` (gated on the env var), next to the existing per-model telemetry push. `DPServer`/`PDDecodeServer` inherit `_start_engine`, so the OpenAI, DP, and PD patterns are all covered. Recording replica-side (vs at build time) matches every other usage tag and guarantees GCS is available. ## Checks - [x] Signed off with DCO. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>


Why are these changes needed?
Adds a
LLM_SERVE_DIRECT_STREAMING_ENABLEDusage tag so we can track adoption of LLM direct streaming (RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING).Recorded inline in
LLMServer._start_engine(gated on the env var), next to the existing per-model telemetry push.DPServer/PDDecodeServerinherit_start_engine, so the OpenAI, DP, and PD patterns are all covered. Recording replica-side (vs at build time) matches every other usage tag and guarantees GCS is available.Checks