[Serve] Add tracing support in Ray Serve#61230
Conversation
Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Code Review
The pull request introduces distributed tracing support to Ray Serve using OpenTelemetry. This is a significant enhancement for observability. However, the current implementation has several critical issues related to OpenTelemetry context management and internal state consistency. Specifically, the TraceContextManager fails to detach OpenTelemetry contexts, which will lead to context leakage across requests. Additionally, the internal TRACE_STACK uses a mutable list in a ContextVar, which is not thread-safe for inherited contexts and can cause stack corruption. There are also logic errors in how non-sampled spans are handled, potentially leading to parent spans being incorrectly popped from the stack. Addressing these issues is essential for the stability and correctness of the tracing infrastructure.
| if not self.span.get_span_context().trace_flags.sampled: | ||
| return self |
There was a problem hiding this comment.
Returning early when a span is not sampled causes a mismatch between __enter__ and __exit__. Since self.span is already started at line 96, __exit__ will still execute and call _pop_trace_stack(). Because _append_trace_stack was skipped here, _pop_trace_stack will incorrectly pop the parent span (or whatever is currently on top of the stack), leading to stack corruption for all subsequent spans in this task.
| if not self.span.get_span_context().trace_flags.sampled: | ||
| return self | ||
| new_ctx = set_span_in_context(self.span) | ||
| set_trace_context(new_ctx) |
There was a problem hiding this comment.
The OpenTelemetry context attached via set_trace_context (which calls opentelemetry.context.attach) must be detached in __exit__ using the returned token. Failing to do so causes the context to leak, meaning subsequent requests handled by the same worker thread/task might incorrectly inherit this request's trace context as their parent.
| set_trace_context(new_ctx) | |
| self._token = set_trace_context(new_ctx) |
| self.span.end() | ||
| _pop_trace_stack() |
There was a problem hiding this comment.
As noted in the __enter__ feedback, the OTEL context must be detached. Also, the logic must ensure that _pop_trace_stack is only called if a span was actually pushed.
| self.span.end() | |
| _pop_trace_stack() | |
| if getattr(self, "_token", None) is not None: | |
| detach_trace_context(self._token) | |
| self.span.end() | |
| _pop_trace_stack() |
| def _append_trace_stack(span): | ||
| """Append span to global trace stack.""" | ||
| trace_stack = TRACE_STACK.get([]) | ||
| trace_stack.append(span) | ||
| TRACE_STACK.set(trace_stack) | ||
|
|
There was a problem hiding this comment.
Using a mutable list inside a ContextVar is dangerous when tasks are nested or concurrent. When a child task is created, it inherits the same list instance. If the child task appends to it, it modifies the parent's list. This leads to race conditions and incorrect stack states. Use an immutable structure like a tuple to ensure proper isolation.
| def _append_trace_stack(span): | |
| """Append span to global trace stack.""" | |
| trace_stack = TRACE_STACK.get([]) | |
| trace_stack.append(span) | |
| TRACE_STACK.set(trace_stack) | |
| def _append_trace_stack(span): | |
| """Append span to global trace stack.""" | |
| trace_stack = TRACE_STACK.get(()) | |
| TRACE_STACK.set(trace_stack + (span,)) | |
| def _pop_trace_stack(): | ||
| """Pop span to global trace stack.""" | ||
| trace_stack = TRACE_STACK.get([]) | ||
| if trace_stack: | ||
| trace_stack.pop() | ||
| TRACE_STACK.set(trace_stack) | ||
|
|
There was a problem hiding this comment.
Similar to _append_trace_stack, this should use immutable operations to avoid cross-task pollution of the trace stack.
| def _pop_trace_stack(): | |
| """Pop span to global trace stack.""" | |
| trace_stack = TRACE_STACK.get([]) | |
| if trace_stack: | |
| trace_stack.pop() | |
| TRACE_STACK.set(trace_stack) | |
| def _pop_trace_stack(): | |
| """Pop span to global trace stack.""" | |
| trace_stack = TRACE_STACK.get(()) | |
| if trace_stack: | |
| TRACE_STACK.set(trace_stack[:-1]) | |
| if is_span_recording(): | ||
| set_http_span_attributes( | ||
| method=proxy_request.method, | ||
| status_code=int(status_code), |
There was a problem hiding this comment.
Calling int(status_code) here is risky because status_code might be None if an exception occurred before the response started, or it might be a non-numeric string in edge cases. int(None) will raise a TypeError, crashing the proxy's request handling loop.
| status_code=int(status_code), | |
| status_code=int(status_code) if status_code is not None else 500, |
| os.makedirs(spans_dir, exist_ok=True) | ||
| spans_file = os.path.join(spans_dir, tracing_file_name) | ||
|
|
||
| return [SimpleSpanProcessor(ConsoleSpanExporter(out=open(spans_file, "a")))] |
There was a problem hiding this comment.
Signed-off-by: abrar <abrar@anyscale.com>
| self.span.end() | ||
| _pop_trace_stack() | ||
|
|
||
| return False |
There was a problem hiding this comment.
Unsampled spans incorrectly pop from trace stack
Medium Severity
TraceContextManager.__exit__ unconditionally calls _pop_trace_stack() when self.span is not None, but __enter__ skips _append_trace_stack() for unsampled spans (early return at the trace_flags.sampled check). This creates a push/pop mismatch — __exit__ pops a span that was never pushed. If a sampled parent span exists on the stack, the unsampled child's exit would incorrectly pop the parent, corrupting the trace stack for subsequent attribute/status calls on the parent span.
Additional Locations (1)
| trace_context = extract_propagated_context( | ||
| {key.decode(): value.decode()} | ||
| ) | ||
| set_trace_context(trace_context) |
There was a problem hiding this comment.
Proxy tracing omits tracestate header extraction
Low Severity
ASGIProxyRequest.populate_tracing_context and gRPCProxyRequest.populate_tracing_context only extract the traceparent header, discarding tracestate. In contrast, the direct ingress path in Replica.get_asgi_tracing_context and Replica.get_grpc_tracing_context correctly extracts both traceparent and tracestate. This inconsistency causes vendor-specific trace state information to be lost when requests go through the proxy.
Additional Locations (1)
| method=request_metadata._http_method, | ||
| status_code=status_code, | ||
| route=request_metadata.route, | ||
| ) |
There was a problem hiding this comment.
Replica passes string status_code instead of integer
Low Severity
In Replica._record_errors_and_metrics, status_code (an Optional[str]) is passed directly to set_http_span_attributes without casting to int. The proxy correctly calls int(status_code) at the same call site. Per OpenTelemetry semantic conventions, http.status_code expects an integer. This inconsistency causes the replica to emit a string-typed status code attribute (e.g., "200") while the proxy emits an integer (200), which may confuse tracing backends.
Signed-off-by: abrar <abrar@anyscale.com>
| os.makedirs(spans_dir, exist_ok=True) | ||
| spans_file = os.path.join(spans_dir, tracing_file_name) | ||
|
|
||
| return [SimpleSpanProcessor(ConsoleSpanExporter(out=open(spans_file, "a")))] |
There was a problem hiding this comment.
File handle leaked in default tracing exporter
Low Severity
default_tracing_exporter calls open(spans_file, "a") and passes the file object directly to ConsoleSpanExporter without ever closing it. The file handle remains open for the lifetime of the process with no cleanup mechanism, leaking a file descriptor per component.
| ) | ||
| from ray.serve._private.proxy_response_generator import ProxyResponseGenerator | ||
| from ray.serve._private.proxy_router import ProxyRouter | ||
| from ray.serve._private.tracing_utils import ( |
There was a problem hiding this comment.
Parity except import path being different
|
|
||
| from ray.serve._private.common import StreamingHTTPRequest, gRPCRequest | ||
| from ray.serve._private.constants import SERVE_LOGGER_NAME | ||
| from ray.serve._private.tracing_utils import ( |
There was a problem hiding this comment.
parity except import path being different
| extract_route_patterns, | ||
| get_asgi_route_name, | ||
| ) | ||
| from ray.serve._private.tracing_utils import ( |
| PowerOfTwoChoicesRequestRouter, | ||
| ) | ||
| from ray.serve._private.request_router.replica_wrapper import RunningReplica | ||
| from ray.serve._private.tracing_utils import ( |
There was a problem hiding this comment.
parity except import
| @@ -0,0 +1,501 @@ | |||
| import inspect | |||
There was a problem hiding this comment.
parity except import, env vars, typing and docstirng
| @@ -0,0 +1,1031 @@ | |||
| import json | |||
There was a problem hiding this comment.
parity except imports
| BATCH_WAIT_TIME_BUCKETS_MS, | ||
| SERVE_LOGGER_NAME, | ||
| ) | ||
| from ray.serve._private.tracing_utils import ( |
There was a problem hiding this comment.
parity except imports
|
@abrarsheikh can we break it into smaller PRs? It will help in the review process |
There was a problem hiding this comment.
The fixtures are also parity, verified myself


RAY_SERVE_TRACING_EXPORTER_IMPORT_PATHenvironment variable (disabled by default).ray.serve.get_trace_context()as a public API so users can create child spans linked to the Serve request trace from within their deployments.Details
Core tracing infrastructure (
ray/serve/_private/tracing_utils.py):TraceContextManagerandBatchTraceContextManagerfor span lifecycle managementtracing_decorator_factoryfor instrumenting functions with spanscreate_propagated_context,extract_propagated_context)RAY_SERVE_TRACING_SAMPLING_RATIOand pluggable exportersInstrumented components:
proxy.py,proxy_request_response.py): Extractstraceparent/tracestatefrom incoming HTTP headers and gRPC metadata; creates proxy-level spans with HTTP/RPC semantic attributesrouter.py): Creates routing spans and propagates context to replicasreplica.py): Creates replica-level spans, records error attributes and status codesbatching.py): Links batched execution spans to the first request's trace contextPublic API (
ray.serve.get_trace_context()):Test plan
test_tracing_utils.pywith unit tests for span lifecycle, status setting, multithreaded trace stacks, and custom exporters