[serve] haproxy ingress request router metrics by akyang-anyscale · Pull Request #63356 · ray-project/ray

akyang-anyscale · 2026-05-15T01:04:52Z

Adds observability into HAProxy ingress request router metrics. Adds the following:

Metrics:

serve_haproxy_ingress_router_latency_ms | Histogram | Wall-clock time HAProxy spent consulting the ingress request router (measured around the Lua socket call) for all consultations only. Buckets cover 0.5 ms to 1 s. Categorized by outcome "success" and "failure".
serve_haproxy_ingress_router_truncations | Counter | Number of requests whose body was clipped by HAProxy's tune.bufsize before being forwarded to the router. The router still gets a prefix plus an X-Body-Truncated: <have>/<full> header. Non-zero values indicate a body-aware policy may be missing context; consider raising RAY_SERVE_HAPROXY_INGRESS_REQUEST_ROUTER_BUFSIZE.
serve_haproxy_ingress_router_server_mismatch | Counter | Number of requests where HAProxy ultimately routed to a different replica than the router returned. This happens when the named replica is DOWN and option redispatch falls through to load balancing. Non-zero values indicate the router's view of replica health is stale, or replicas are flapping.
serve_haproxy_ingress_router_failures | Counter | Number of router consultations that failed to pin a replica. Each failure causes HAProxy to return 503 to the client. The reason tag is one of router_unreachable (socket connect/send/recv failed), router_non_200 (router returned a non-200 status), unparseable_replica_id (router 200 but body didn't contain a string replica_id), or unknown_replica_id (router returned a replica_id not in HAProxy's current replica map).
serve_haproxy_ingress_router_requests_total Total number of requests handled by the ingress router.

For metrics, haproxy is emitting logs with request + ingress router metadata (target server, actual server, router latency, failure reason, etc.) to a metrics socket. This socket is scraped by HAProxyManager, which parses the log line, and ultimately increments/adds an observation to the appropriate counters.

Logs:

controller log saying if ingress request router exists for an ingress deployment
ingress request router targets are added to haproxy's long poll receive log

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces per-request metrics for the HAProxy ingress request router by adding configuration options, updating Lua and HAProxy templates to capture latency and truncation data, and implementing a socket-based collection mechanism in the HAProxyManager. Feedback highlights a missing module import that would lead to runtime errors and suggests explicitly cancelling the metrics attachment task during shutdown to prevent dangling event loop tasks.

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

eicherseiji

Some metric naming convention nits. Could we also add serve_haproxy_ingress_router_requests_total tagged with {application} to get an easy denominator for rate?

eicherseiji · 2026-05-19T20:59:52Z

+
+| Metric | Type | Tags | Description |
+|--------|------|------|-------------|
+| `serve_haproxy_ingress_router_latency_ms` | Histogram | `application` | Wall-clock time HAProxy spent consulting the ingress request router (measured around the Lua socket call) for **successful** consultations only. Buckets cover 0.5 ms to 1 s. Use to detect router-side slowdowns before they show up in end-to-end p99. |


Suggested change

| `serve_haproxy_ingress_router_latency_ms` | Histogram | `application` | Wall-clock time HAProxy spent consulting the ingress request router (measured around the Lua socket call) for **successful** consultations only. Buckets cover 0.5 ms to 1 s. Use to detect router-side slowdowns before they show up in end-to-end p99. |

| `serve_haproxy_ingress_router_duration_seconds` | Histogram | `application` | Wall-clock time HAProxy spent consulting the ingress request router (measured around the Lua socket call) for **successful** consultations only. Buckets cover 0.5 ms to 1 s. Use to detect router-side slowdowns before they show up in end-to-end p99. |

Could we also include failures in this histogram with the option to split by outcome="success|failure"?

I think the latency_ms suffix fits better with the naming convention of other metrics, but I can apply the suggestion if you feel strongly about it.

Nah not blocking. The _seconds part is the most relevant best practice https://prometheus.io/docs/practices/naming/

latency by outcome
histogram_quantile(0.9, sum(rate(ray_serve_haproxy_ingress_router_latency_ms_bucket{route=~".*",route!~"/-/.*",application=~".*",deployment=~".*",replica=~".*",ray_io_cluster=~".*",ClusterId=~"ses_fpdiazhqrzpeakwf1aw4zmrdrs", outcome=~".+"}[5m])) by (application, outcome, le))

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{Reviewed by Cursor Bugbot for commit ded3226. Configure here.}

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

eicherseiji

🙌

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

akyang-anyscale · 2026-05-20T04:59:44Z

total request counter + other metrics

sum(rate(ray_serve_haproxy_ingress_router_requests_total{route=~".*",route!~"/-/.*",application=~".*",deployment=~".*",replica=~".*",ray_io_cluster=~".*",ClusterId=~"ses_fpdiazhqrzpeakwf1aw4zmrdrs"}[5m])) by (application, deployment, replica)

sum(rate(ray_serve_haproxy_ingress_router_truncations_total{route=~".*",route!~"/-/.*",application=~".*",deployment=~".*",replica=~".*",ray_io_cluster=~".*",ClusterId=~"ses_fpdiazhqrzpeakwf1aw4zmrdrs"}[5m])) by (application, deployment, replica)

sum(rate(ray_serve_haproxy_ingress_router_failures_total{route=~".*",route!~"/-/.*",application=~".*",deployment=~".*",replica=~".*",ray_io_cluster=~".*",ClusterId=~"ses_fpdiazhqrzpeakwf1aw4zmrdrs"}[5m])) by (application, reason, replica)

sum(rate(ray_serve_haproxy_ingress_router_server_mismatch_total{route=~".*",route!~"/-/.*",application=~".*",deployment=~".*",replica=~".*",ray_io_cluster=~".*",ClusterId=~"ses_fpdiazhqrzpeakwf1aw4zmrdrs"}[5m])) by (application, replica)

kouroshHakha

Overall this is clean work — the syslog-over-dgram approach is lightweight, the collector cleanly separates socket lifecycle from metric ownership, and the metrics-disabled path genuinely has zero overhead (no log target, no Lua timing calls, no socket). Two things below worth addressing before merge.

Note

This review was co-written with AI assistance (Claude Code).

kouroshHakha · 2026-05-20T20:29:44Z

+    # Per-request metrics for the ingress request router. Goes only to the
+    # rfc5424 target below; the inherited rfc3164 targets do not include the
+    # SD section, so their byte stream is unchanged.
+    log {{ metrics_socket_path }} len 8192 format rfc5424 local1 info


HAProxy log directive in a section overrides inherited log global, silently dropping standard access logs when metrics are enabled.

HAProxy docs: "If at least one log directive appears in a proxy section, it takes precedence over and replaces all inherited log directives." The defaults section has log global, but once frontend http_frontend gains its own log {{ metrics_socket_path }} ... line, it no longer inherits log global — all standard syslog logging (to /dev/log and 127.0.0.1:{{ config.syslog_port }}) stops for this frontend.

The comment above says "the inherited rfc3164 targets do not include the SD section, so their byte stream is unchanged" — but the problem is the rfc3164 targets won't be reached at all when metrics are enabled.

Fix: add log global in the metrics block so both targets are active:

{%- if ingress_request_router_metrics_enabled and has_ingress_request_router %} + log global log {{ metrics_socket_path }} len 8192 format rfc5424 local1 info log-format-sd "..." {%- endif %}

kouroshHakha · 2026-05-20T20:29:44Z

                extra={"log_to_stderr": False},
            )

            await self._haproxy.stop()


_metrics_collector.close() is skipped when _haproxy.stop() raises.

If stop() throws, the collector's dgram transport and socket file are leaked. Moving collector cleanup to a finally block fixes it:

try: await self._haproxy.stop() finally: if self._metrics_collector is not None: self._metrics_collector.close() self._metrics_collector = None

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

metrics

dc6a695

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py

Comment thread python/ray/serve/_private/haproxy.py Outdated

akyang-anyscale added 5 commits May 15, 2026 19:14

refactor collector

3f4220b

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

add collector + lint

0cea156

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

udpate metrics

ef71eb4

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

test and doc

5f37562

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

lint

685a327

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

akyang-anyscale marked this pull request as ready for review May 15, 2026 21:10

akyang-anyscale requested a review from a team as a code owner May 15, 2026 21:10

cursor Bot reviewed May 15, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy_metrics.py

Comment thread python/ray/serve/_private/haproxy_templates.py

rename var

0171971

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

akyang-anyscale added the go add ONLY when ready to merge, run all tests label May 15, 2026

cursor Bot reviewed May 15, 2026

View reviewed changes

Comment thread python/ray/serve/_private/constants.py

akyang-anyscale added 5 commits May 15, 2026 21:51

spacing

9373dbc

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

logs

342669c

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

env bool

942dd72

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

bazel

f5c34e7

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

Merge branch 'master' into alexyang/ingress-rr-o11y

601acef

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

ray-gardener Bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels May 16, 2026

fix sentinel

f48c94c

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

cursor Bot reviewed May 18, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

Comment thread python/ray/serve/_private/haproxy.py Outdated

akyang-anyscale added 2 commits May 18, 2026 07:38

try/catch

564badd

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

log-format

7d0c2f1

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

akyang-anyscale requested review from abrarsheikh, eicherseiji and kouroshHakha May 19, 2026 01:32

akyang-anyscale added 2 commits May 19, 2026 01:53

disable by default

e5464c2

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

disable by default

c84fb12

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

eicherseiji reviewed May 19, 2026

View reviewed changes

metric names + latency outcome

ded3226

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

cursor Bot reviewed May 20, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy_metrics.py

akyang-anyscale added 2 commits May 20, 2026 00:32

total requests counter

22ffdb5

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

latency outcome

7a5ac39

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

eicherseiji approved these changes May 20, 2026

View reviewed changes

Merge branch 'master' into alexyang/ingress-rr-o11y

937cb3f

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

kouroshHakha reviewed May 20, 2026

View reviewed changes

comments

d7d4e4f

Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

kouroshHakha approved these changes May 20, 2026

View reviewed changes

kouroshHakha enabled auto-merge (squash) May 20, 2026 20:59

kouroshHakha merged commit a3a0d26 into ray-project:master May 20, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[serve] haproxy ingress request router metrics#63356

[serve] haproxy ingress request router metrics#63356
kouroshHakha merged 22 commits into
ray-project:masterfrom
akyang-anyscale:alexyang/ingress-rr-o11y

akyang-anyscale commented May 15, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eicherseiji left a comment

eicherseiji May 19, 2026

eicherseiji May 19, 2026

akyang-anyscale May 20, 2026

eicherseiji May 20, 2026

akyang-anyscale May 20, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

eicherseiji left a comment

akyang-anyscale commented May 20, 2026 •

edited

Loading

kouroshHakha left a comment

kouroshHakha May 20, 2026

kouroshHakha May 20, 2026

Uh oh!

Labels

3 participants

	\| `serve_haproxy_ingress_router_latency_ms` \| Histogram \| `application` \| Wall-clock time HAProxy spent consulting the ingress request router (measured around the Lua socket call) for successful consultations only. Buckets cover 0.5 ms to 1 s. Use to detect router-side slowdowns before they show up in end-to-end p99. \|
	\| `serve_haproxy_ingress_router_duration_seconds` \| Histogram \| `application` \| Wall-clock time HAProxy spent consulting the ingress request router (measured around the Lua socket call) for successful consultations only. Buckets cover 0.5 ms to 1 s. Use to detect router-side slowdowns before they show up in end-to-end p99. \|

Uh oh!

Conversation

akyang-anyscale commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eicherseiji left a comment

Choose a reason for hiding this comment

eicherseiji May 19, 2026

Choose a reason for hiding this comment

eicherseiji May 19, 2026

Choose a reason for hiding this comment

akyang-anyscale May 20, 2026

Choose a reason for hiding this comment

eicherseiji May 20, 2026

Choose a reason for hiding this comment

akyang-anyscale May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

eicherseiji left a comment

Choose a reason for hiding this comment

akyang-anyscale commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

kouroshHakha May 20, 2026

Choose a reason for hiding this comment

kouroshHakha May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

akyang-anyscale commented May 15, 2026 •

edited

Loading

akyang-anyscale commented May 20, 2026 •

edited

Loading