[Data][LLM] Add vLLM metrics export and Data LLM Grafana dashboard#60385
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces vLLM metrics exporting for Ray Data LLM batch inference and adds a corresponding Grafana dashboard for visualization. The changes are well-structured, adding a log_engine_metrics configuration option and integrating with vLLM's RayPrometheusStatLogger. The new dashboard provides valuable insights into vLLM engine performance. My review includes a couple of suggestions to enhance the new Grafana dashboard for better clarity and consistency.
| Panel( | ||
| id=8, | ||
| title="vLLM: Queue Time", | ||
| description="Time requests spend waiting in the queue before processing.", | ||
| unit="s", | ||
| targets=[ | ||
| Target( | ||
| expr='sum by(model_name, WorkerId) (rate(ray_vllm_request_queue_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))', | ||
| legend="{{model_name}} - {{WorkerId}}", | ||
| ), | ||
| ], | ||
| fill=1, | ||
| linewidth=2, | ||
| stack=False, | ||
| grid_pos=GridPos(12, 24, 12, 8), | ||
| ), |
There was a problem hiding this comment.
The "vLLM: Queue Time" panel currently displays the rate of the sum of queue times (rate(sum)), which can be an unintuitive metric. The panel's description, "Time requests spend waiting in the queue before processing," suggests that a per-request latency metric like average or percentile queue time would be more appropriate and easier to interpret.
Since ray_vllm_request_queue_time_seconds is a histogram, you can create a more informative panel that is consistent with the other latency panels in this dashboard (e.g., TTFT, E2E Latency) by showing P50, P90, P95, P99, and Mean queue times. This would provide a clearer and more comprehensive view of queueing performance.
Panel(
id=8,
title="vLLM: Queue Time",
description="P50, P90, P95, P99, and Mean time requests spend waiting in the queue before processing.",
unit="s",
targets=[
Target(
expr='histogram_quantile(0.99, sum by(le, model_name, WorkerId) (rate(ray_vllm_request_queue_time_seconds_bucket{{model_name=~\"$vllm_model_name\", WorkerId=~\"$workerid\", {global_filters}}}[$interval])))',
legend="P99 - {{model_name}} - {{WorkerId}}",
),
Target(
expr='histogram_quantile(0.95, sum by(le, model_name, WorkerId) (rate(ray_vllm_request_queue_time_seconds_bucket{{model_name=~\"$vllm_model_name\", WorkerId=~\"$workerid\", {global_filters}}}[$interval])))',
legend="P95 - {{model_name}} - {{WorkerId}}",
),
Target(
expr='histogram_quantile(0.90, sum by(le, model_name, WorkerId) (rate(ray_vllm_request_queue_time_seconds_bucket{{model_name=~\"$vllm_model_name\", WorkerId=~\"$workerid\", {global_filters}}}[$interval])))',
legend="P90 - {{model_name}} - {{WorkerId}}",
),
Target(
expr='histogram_quantile(0.50, sum by(le, model_name, WorkerId) (rate(ray_vllm_request_queue_time_seconds_bucket{{model_name=~\"$vllm_model_name\", WorkerId=~\"$workerid\", {global_filters}}}[$interval])))',
legend="P50 - {{model_name}} - {{WorkerId}}",
),
Target(
expr='(sum by(model_name, WorkerId) (rate(ray_vllm_request_queue_time_seconds_sum{{model_name=~\"$vllm_model_name\", WorkerId=~\"$workerid\", {global_filters}}}[$interval]))) / (sum by(model_name, WorkerId) (rate(ray_vllm_request_queue_time_seconds_count{{model_name=~\"$vllm_model_name\", WorkerId=~\"$workerid\", {global_filters}}}[$interval])))',
legend="Mean - {{model_name}} - {{WorkerId}}",
),
],
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 24, 12, 8),
),| "current": { | ||
| "selected": true, | ||
| "text": "5m", | ||
| "value": "5m" | ||
| } |
There was a problem hiding this comment.
There is an inconsistency in the default value for the interval template variable. The options array has 30s marked as "selected": true, but the current value is set to 5m. While Grafana prioritizes the current value for initialization, this discrepancy can be confusing.
To ensure consistency and set a more common default for near-real-time monitoring, I suggest updating the current value to 30s to match the selected option.
| "current": { | |
| "selected": true, | |
| "text": "5m", | |
| "value": "5m" | |
| } | |
| "current": { | |
| "selected": true, | |
| "text": "30s", | |
| "value": "30s" | |
| } |
|
Thank you so much @nrghosh for driving this. |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Add Prometheus metrics export for Ray Data LLM batch inference. When enabled, vLLM engine metrics (TTFT, TPOT, prefix cache hit rate, KV cache utilization, etc.) are exported to Ray's metrics endpoint. - Add `log_engine_metrics` config option (default=True) - Integrate vLLM's RayPrometheusStatLogger - Add Data LLM Grafana dashboard Addresses ray-project#58360. Thanks @anindya-saha for the approach. Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…afana Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…rve llm Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
bd33c5f to
4718409
Compare
There was a problem hiding this comment.
aren't data and serve llms supposed to be identical (at least on the vllm panels?) I am afraid that if we want to add a new vLLM metric (e.g. NIXL transfer metrics) we then have to do it in two places and we start to diverge in consistency.
I am fine with renaming the current serve_llm dashboard to something like LLM dashboards and then have some sort of separation inside the dashboard to distinguish between engine metrics and orchestrator metrics.
In serve orchestration metrics is something like ray serve QPS while in ray data we might be interested in some other thing.
There was a problem hiding this comment.
If there are fields that are different (e.g. Replica ID etc.) the question is how do we maximally share the engine panels between serve and data
There was a problem hiding this comment.
Refactored the dashboard in the latest revision.
a043cf9 to
a99f9c4
Compare
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
a99f9c4 to
44f53fa
Compare
MengjinYan
left a comment
There was a problem hiding this comment.
Looks good from Core side!
Dashboard related changes will need observability team to take a look. cc: @alanwguo
…ay-project#60385) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…ay-project#60385) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com>





Summary
log_engine_metricsconfig option tovLLMEngineProcessorConfig(default=True)RayPrometheusStatLoggerfor metrics exportAddresses #58360. Thanks @anindya-saha for the approach.