Skip to content

[serve] haproxy ingress request router metrics#63356

Merged
kouroshHakha merged 22 commits into
ray-project:masterfrom
akyang-anyscale:alexyang/ingress-rr-o11y
May 20, 2026
Merged

[serve] haproxy ingress request router metrics#63356
kouroshHakha merged 22 commits into
ray-project:masterfrom
akyang-anyscale:alexyang/ingress-rr-o11y

Conversation

@akyang-anyscale

@akyang-anyscale akyang-anyscale commented May 15, 2026

Copy link
Copy Markdown
Contributor

Adds observability into HAProxy ingress request router metrics. Adds the following:

Metrics:

  • serve_haproxy_ingress_router_latency_ms | Histogram | Wall-clock time HAProxy spent consulting the ingress request router (measured around the Lua socket call) for all consultations only. Buckets cover 0.5 ms to 1 s. Categorized by outcome "success" and "failure".
  • serve_haproxy_ingress_router_truncations | Counter | Number of requests whose body was clipped by HAProxy's tune.bufsize before being forwarded to the router. The router still gets a prefix plus an X-Body-Truncated: <have>/<full> header. Non-zero values indicate a body-aware policy may be missing context; consider raising RAY_SERVE_HAPROXY_INGRESS_REQUEST_ROUTER_BUFSIZE.
  • serve_haproxy_ingress_router_server_mismatch | Counter | Number of requests where HAProxy ultimately routed to a different replica than the router returned. This happens when the named replica is DOWN and option redispatch falls through to load balancing. Non-zero values indicate the router's view of replica health is stale, or replicas are flapping.
  • serve_haproxy_ingress_router_failures | Counter | Number of router consultations that failed to pin a replica. Each failure causes HAProxy to return 503 to the client. The reason tag is one of router_unreachable (socket connect/send/recv failed), router_non_200 (router returned a non-200 status), unparseable_replica_id (router 200 but body didn't contain a string replica_id), or unknown_replica_id (router returned a replica_id not in HAProxy's current replica map).
  • serve_haproxy_ingress_router_requests_total Total number of requests handled by the ingress router.

For metrics, haproxy is emitting logs with request + ingress router metadata (target server, actual server, router latency, failure reason, etc.) to a metrics socket. This socket is scraped by HAProxyManager, which parses the log line, and ultimately increments/adds an observation to the appropriate counters.
Screenshot 2026-05-18 at 6 44 40 PM
Screenshot 2026-05-18 at 6 44 55 PM

Logs:

  • controller log saying if ingress request router exists for an ingress deployment
  • ingress request router targets are added to haproxy's long poll receive log
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces per-request metrics for the HAProxy ingress request router by adding configuration options, updating Lua and HAProxy templates to capture latency and truncation data, and implementing a socket-based collection mechanism in the HAProxyManager. Feedback highlights a missing module import that would lead to runtime errors and suggests explicitly cancelling the metrics attachment task during shutdown to prevent dangling event loop tasks.

Comment thread python/ray/serve/_private/haproxy.py
Comment thread python/ray/serve/_private/haproxy.py Outdated
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
@akyang-anyscale akyang-anyscale marked this pull request as ready for review May 15, 2026 21:10
@akyang-anyscale akyang-anyscale requested a review from a team as a code owner May 15, 2026 21:10
Comment thread python/ray/serve/_private/haproxy_metrics.py
Comment thread python/ray/serve/_private/haproxy_templates.py
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
@akyang-anyscale akyang-anyscale added the go add ONLY when ready to merge, run all tests label May 15, 2026
Comment thread python/ray/serve/_private/constants.py
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels May 16, 2026
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Comment thread python/ray/serve/_private/haproxy.py Outdated
Comment thread python/ray/serve/_private/haproxy.py Outdated
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

@eicherseiji eicherseiji left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some metric naming convention nits. Could we also add serve_haproxy_ingress_router_requests_total tagged with {application} to get an easy denominator for rate?

Comment thread doc/source/serve/monitoring.md Outdated

| Metric | Type | Tags | Description |
|--------|------|------|-------------|
| `serve_haproxy_ingress_router_latency_ms` | Histogram | `application` | Wall-clock time HAProxy spent consulting the ingress request router (measured around the Lua socket call) for **successful** consultations only. Buckets cover 0.5 ms to 1 s. Use to detect router-side slowdowns before they show up in end-to-end p99. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `serve_haproxy_ingress_router_latency_ms` | Histogram | `application` | Wall-clock time HAProxy spent consulting the ingress request router (measured around the Lua socket call) for **successful** consultations only. Buckets cover 0.5 ms to 1 s. Use to detect router-side slowdowns before they show up in end-to-end p99. |
| `serve_haproxy_ingress_router_duration_seconds` | Histogram | `application` | Wall-clock time HAProxy spent consulting the ingress request router (measured around the Lua socket call) for **successful** consultations only. Buckets cover 0.5 ms to 1 s. Use to detect router-side slowdowns before they show up in end-to-end p99. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also include failures in this histogram with the option to split by outcome="success|failure"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the latency_ms suffix fits better with the naming convention of other metrics, but I can apply the suggestion if you feel strongly about it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah not blocking. The _seconds part is the most relevant best practice https://prometheus.io/docs/practices/naming/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2026-05-19 at 9 57 28 PM latency by outcome
histogram_quantile(0.9, sum(rate(ray_serve_haproxy_ingress_router_latency_ms_bucket{route=~".*",route!~"/-/.*",application=~".*",deployment=~".*",replica=~".*",ray_io_cluster=~".*",ClusterId=~"ses_fpdiazhqrzpeakwf1aw4zmrdrs", outcome=~".+"}[5m])) by (application, outcome, le))
Comment thread doc/source/serve/monitoring.md Outdated
Comment thread doc/source/serve/monitoring.md Outdated
Comment thread doc/source/serve/monitoring.md Outdated
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit ded3226. Configure here.

Comment thread python/ray/serve/_private/haproxy_metrics.py
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

@eicherseiji eicherseiji left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
@akyang-anyscale

akyang-anyscale commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

total request counter + other metrics

Screenshot 2026-05-19 at 10 02 36 PM
sum(rate(ray_serve_haproxy_ingress_router_requests_total{route=~".*",route!~"/-/.*",application=~".*",deployment=~".*",replica=~".*",ray_io_cluster=~".*",ClusterId=~"ses_fpdiazhqrzpeakwf1aw4zmrdrs"}[5m])) by (application, deployment, replica)

sum(rate(ray_serve_haproxy_ingress_router_truncations_total{route=~".*",route!~"/-/.*",application=~".*",deployment=~".*",replica=~".*",ray_io_cluster=~".*",ClusterId=~"ses_fpdiazhqrzpeakwf1aw4zmrdrs"}[5m])) by (application, deployment, replica)

sum(rate(ray_serve_haproxy_ingress_router_failures_total{route=~".*",route!~"/-/.*",application=~".*",deployment=~".*",replica=~".*",ray_io_cluster=~".*",ClusterId=~"ses_fpdiazhqrzpeakwf1aw4zmrdrs"}[5m])) by (application, reason, replica)

sum(rate(ray_serve_haproxy_ingress_router_server_mismatch_total{route=~".*",route!~"/-/.*",application=~".*",deployment=~".*",replica=~".*",ray_io_cluster=~".*",ClusterId=~"ses_fpdiazhqrzpeakwf1aw4zmrdrs"}[5m])) by (application, replica)

@kouroshHakha kouroshHakha left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is clean work — the syslog-over-dgram approach is lightweight, the collector cleanly separates socket lifecycle from metric ownership, and the metrics-disabled path genuinely has zero overhead (no log target, no Lua timing calls, no socket). Two things below worth addressing before merge.

Note

This review was co-written with AI assistance (Claude Code).

# Per-request metrics for the ingress request router. Goes only to the
# rfc5424 target below; the inherited rfc3164 targets do not include the
# SD section, so their byte stream is unchanged.
log {{ metrics_socket_path }} len 8192 format rfc5424 local1 info

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HAProxy log directive in a section overrides inherited log global, silently dropping standard access logs when metrics are enabled.

HAProxy docs: "If at least one log directive appears in a proxy section, it takes precedence over and replaces all inherited log directives." The defaults section has log global, but once frontend http_frontend gains its own log {{ metrics_socket_path }} ... line, it no longer inherits log global — all standard syslog logging (to /dev/log and 127.0.0.1:{{ config.syslog_port }}) stops for this frontend.

The comment above says "the inherited rfc3164 targets do not include the SD section, so their byte stream is unchanged" — but the problem is the rfc3164 targets won't be reached at all when metrics are enabled.

Fix: add log global in the metrics block so both targets are active:

    {%- if ingress_request_router_metrics_enabled and has_ingress_request_router %}
+   log global
    log {{ metrics_socket_path }} len 8192 format rfc5424 local1 info
    log-format-sd "..."
    {%- endif %}
extra={"log_to_stderr": False},
)

await self._haproxy.stop()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_metrics_collector.close() is skipped when _haproxy.stop() raises.

If stop() throws, the collector's dgram transport and socket file are leaked. Moving collector cleanup to a finally block fixes it:

try:
    await self._haproxy.stop()
finally:
    if self._metrics_collector is not None:
        self._metrics_collector.close()
        self._metrics_collector = None
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
@kouroshHakha kouroshHakha enabled auto-merge (squash) May 20, 2026 20:59
@kouroshHakha kouroshHakha merged commit a3a0d26 into ray-project:master May 20, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling serve Ray Serve Related Issue

3 participants