Skip to content

[Serve] Add tracing support in Ray Serve#61230

Merged
abrarsheikh merged 3 commits into
masterfrom
abrar-tracking-serve
Feb 25, 2026
Merged

[Serve] Add tracing support in Ray Serve#61230
abrarsheikh merged 3 commits into
masterfrom
abrar-tracking-serve

Conversation

@abrarsheikh

Copy link
Copy Markdown
Contributor
  • Adds distributed tracing to Ray Serve using OpenTelemetry, enabling end-to-end visibility across proxy, router, replica, and user application spans.
  • Tracing is opt-in via the RAY_SERVE_TRACING_EXPORTER_IMPORT_PATH environment variable (disabled by default).
  • Exposes ray.serve.get_trace_context() as a public API so users can create child spans linked to the Serve request trace from within their deployments.

Details

Core tracing infrastructure (ray/serve/_private/tracing_utils.py):

  • TraceContextManager and BatchTraceContextManager for span lifecycle management
  • tracing_decorator_factory for instrumenting functions with spans
  • Trace context propagation helpers (create_propagated_context, extract_propagated_context)
  • Configurable sampling via RAY_SERVE_TRACING_SAMPLING_RATIO and pluggable exporters

Instrumented components:

  • Proxy (proxy.py, proxy_request_response.py): Extracts traceparent/tracestate from incoming HTTP headers and gRPC metadata; creates proxy-level spans with HTTP/RPC semantic attributes
  • Router (router.py): Creates routing spans and propagates context to replicas
  • Replica (replica.py): Creates replica-level spans, records error attributes and status codes
  • Batching (batching.py): Links batched execution spans to the first request's trace context

Public API (ray.serve.get_trace_context()):

from opentelemetry import trace
from ray import serve

tracer = trace.get_tracer(__name__)

@serve.deployment
class MyDeployment:
    def __call__(self, request):
        with tracer.start_as_current_span("my_span", context=serve.get_trace_context()):
            ...

Test plan

  • Added test_tracing_utils.py with unit tests for span lifecycle, status setting, multithreaded trace stacks, and custom exporters
  • E2E tests covering HTTP, streaming, and gRPC protocols validating span parent-child associations and semantic attributes
  • E2E error path tests verifying error status propagation across all protocols
  • Batching trace context test verifying spans attach to the correct request's trace
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh requested a review from a team as a code owner February 21, 2026 18:17

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces distributed tracing support to Ray Serve using OpenTelemetry. This is a significant enhancement for observability. However, the current implementation has several critical issues related to OpenTelemetry context management and internal state consistency. Specifically, the TraceContextManager fails to detach OpenTelemetry contexts, which will lead to context leakage across requests. Additionally, the internal TRACE_STACK uses a mutable list in a ContextVar, which is not thread-safe for inherited contexts and can cause stack corruption. There are also logic errors in how non-sampled spans are handled, potentially leading to parent spans being incorrectly popped from the stack. Addressing these issues is essential for the stability and correctness of the tracing infrastructure.

Comment on lines +101 to +102
if not self.span.get_span_context().trace_flags.sampled:
return self

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Returning early when a span is not sampled causes a mismatch between __enter__ and __exit__. Since self.span is already started at line 96, __exit__ will still execute and call _pop_trace_stack(). Because _append_trace_stack was skipped here, _pop_trace_stack will incorrectly pop the parent span (or whatever is currently on top of the stack), leading to stack corruption for all subsequent spans in this task.

if not self.span.get_span_context().trace_flags.sampled:
return self
new_ctx = set_span_in_context(self.span)
set_trace_context(new_ctx)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The OpenTelemetry context attached via set_trace_context (which calls opentelemetry.context.attach) must be detached in __exit__ using the returned token. Failing to do so causes the context to leak, meaning subsequent requests handled by the same worker thread/task might incorrectly inherit this request's trace context as their parent.

Suggested change
set_trace_context(new_ctx)
self._token = set_trace_context(new_ctx)
Comment on lines +117 to +118
self.span.end()
_pop_trace_stack()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

As noted in the __enter__ feedback, the OTEL context must be detached. Also, the logic must ensure that _pop_trace_stack is only called if a span was actually pushed.

Suggested change
self.span.end()
_pop_trace_stack()
if getattr(self, "_token", None) is not None:
detach_trace_context(self._token)
self.span.end()
_pop_trace_stack()
Comment on lines +428 to +433
def _append_trace_stack(span):
"""Append span to global trace stack."""
trace_stack = TRACE_STACK.get([])
trace_stack.append(span)
TRACE_STACK.set(trace_stack)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a mutable list inside a ContextVar is dangerous when tasks are nested or concurrent. When a child task is created, it inherits the same list instance. If the child task appends to it, it modifies the parent's list. This leads to race conditions and incorrect stack states. Use an immutable structure like a tuple to ensure proper isolation.

Suggested change
def _append_trace_stack(span):
"""Append span to global trace stack."""
trace_stack = TRACE_STACK.get([])
trace_stack.append(span)
TRACE_STACK.set(trace_stack)
def _append_trace_stack(span):
"""Append span to global trace stack."""
trace_stack = TRACE_STACK.get(())
TRACE_STACK.set(trace_stack + (span,))
Comment on lines +435 to +441
def _pop_trace_stack():
"""Pop span to global trace stack."""
trace_stack = TRACE_STACK.get([])
if trace_stack:
trace_stack.pop()
TRACE_STACK.set(trace_stack)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to _append_trace_stack, this should use immutable operations to avoid cross-task pollution of the trace stack.

Suggested change
def _pop_trace_stack():
"""Pop span to global trace stack."""
trace_stack = TRACE_STACK.get([])
if trace_stack:
trace_stack.pop()
TRACE_STACK.set(trace_stack)
def _pop_trace_stack():
"""Pop span to global trace stack."""
trace_stack = TRACE_STACK.get(())
if trace_stack:
TRACE_STACK.set(trace_stack[:-1])
if is_span_recording():
set_http_span_attributes(
method=proxy_request.method,
status_code=int(status_code),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling int(status_code) here is risky because status_code might be None if an exception occurred before the response started, or it might be a non-numeric string in edge cases. int(None) will raise a TypeError, crashing the proxy's request handling loop.

Suggested change
status_code=int(status_code),
status_code=int(status_code) if status_code is not None else 500,
os.makedirs(spans_dir, exist_ok=True)
spans_file = os.path.join(spans_dir, tracing_file_name)

return [SimpleSpanProcessor(ConsoleSpanExporter(out=open(spans_file, "a")))]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file opened here is never closed. While this is a factory function called during actor initialization, it still results in a leaked file descriptor per actor. Consider using a managed stream or ensuring the file is closed when the tracer provider is shut down.

Signed-off-by: abrar <abrar@anyscale.com>
self.span.end()
_pop_trace_stack()

return False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsampled spans incorrectly pop from trace stack

Medium Severity

TraceContextManager.__exit__ unconditionally calls _pop_trace_stack() when self.span is not None, but __enter__ skips _append_trace_stack() for unsampled spans (early return at the trace_flags.sampled check). This creates a push/pop mismatch — __exit__ pops a span that was never pushed. If a sampled parent span exists on the stack, the unsampled child's exit would incorrectly pop the parent, corrupting the trace stack for subsequent attribute/status calls on the parent span.

Additional Locations (1)

Fix in Cursor Fix in Web

trace_context = extract_propagated_context(
{key.decode(): value.decode()}
)
set_trace_context(trace_context)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proxy tracing omits tracestate header extraction

Low Severity

ASGIProxyRequest.populate_tracing_context and gRPCProxyRequest.populate_tracing_context only extract the traceparent header, discarding tracestate. In contrast, the direct ingress path in Replica.get_asgi_tracing_context and Replica.get_grpc_tracing_context correctly extracts both traceparent and tracestate. This inconsistency causes vendor-specific trace state information to be lost when requests go through the proxy.

Additional Locations (1)

Fix in Cursor Fix in Web

method=request_metadata._http_method,
status_code=status_code,
route=request_metadata.route,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replica passes string status_code instead of integer

Low Severity

In Replica._record_errors_and_metrics, status_code (an Optional[str]) is passed directly to set_http_span_attributes without casting to int. The proxy correctly calls int(status_code) at the same call site. Per OpenTelemetry semantic conventions, http.status_code expects an integer. This inconsistency causes the replica to emit a string-typed status code attribute (e.g., "200") while the proxy emits an integer (200), which may confuse tracing backends.

Fix in Cursor Fix in Web

@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Feb 21, 2026
Signed-off-by: abrar <abrar@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

os.makedirs(spans_dir, exist_ok=True)
spans_file = os.path.join(spans_dir, tracing_file_name)

return [SimpleSpanProcessor(ConsoleSpanExporter(out=open(spans_file, "a")))]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File handle leaked in default tracing exporter

Low Severity

default_tracing_exporter calls open(spans_file, "a") and passes the file object directly to ConsoleSpanExporter without ever closing it. The file handle remains open for the lifetime of the process with no cleanup mechanism, leaking a file descriptor per component.

Fix in Cursor Fix in Web

)
from ray.serve._private.proxy_response_generator import ProxyResponseGenerator
from ray.serve._private.proxy_router import ProxyRouter
from ray.serve._private.tracing_utils import (

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parity except import path being different


from ray.serve._private.common import StreamingHTTPRequest, gRPCRequest
from ray.serve._private.constants import SERVE_LOGGER_NAME
from ray.serve._private.tracing_utils import (

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parity except import path being different

extract_route_patterns,
get_asgi_route_name,
)
from ray.serve._private.tracing_utils import (

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parity

PowerOfTwoChoicesRequestRouter,
)
from ray.serve._private.request_router.replica_wrapper import RunningReplica
from ray.serve._private.tracing_utils import (

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parity except import

@@ -0,0 +1,501 @@
import inspect

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parity except import, env vars, typing and docstirng

@@ -0,0 +1,1031 @@
import json

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parity except imports

BATCH_WAIT_TIME_BUCKETS_MS,
SERVE_LOGGER_NAME,
)
from ray.serve._private.tracing_utils import (

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parity except imports

@harshit-anyscale

harshit-anyscale commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

@abrarsheikh can we break it into smaller PRs? It will help in the review process

@eicherseiji eicherseiji self-requested a review February 24, 2026 20:48

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixtures are also parity, verified myself

@abrarsheikh abrarsheikh merged commit 50b981d into master Feb 25, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the abrar-tracking-serve branch February 25, 2026 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

3 participants