Skip to content

[Frontend][Metrics] Add vllm:tool_call_parser_invocations_total Prometheus metric#44448

Merged
robertgshaw2-redhat merged 8 commits into
vllm-project:mainfrom
yzong-rh:yzong-rh/tool_parser_metrics
Jun 10, 2026
Merged

[Frontend][Metrics] Add vllm:tool_call_parser_invocations_total Prometheus metric#44448
robertgshaw2-redhat merged 8 commits into
vllm-project:mainfrom
yzong-rh:yzong-rh/tool_parser_metrics

Conversation

@yzong-rh

@yzong-rh yzong-rh commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Purpose

Add a metric for tool parser activity so operators can see how often the parser runs and whether an invocation produced a tool call. This makes it easier to spot tool-calling regressions during model rollouts or runtime changes.

This PR adds the vllm:tool_call_parser_invocations_total counter and records it in DelegatingParser for both non-streaming and streaming tool parser calls, with labels for mode (streaming vs non-streaming), outcome (tool call vs no tool call), and request type (ChatCompletionRequest or ResponsesRequest).

Limitations

  • Covers the non-harmony path only. Harmony path does not yet go through the DelegatingParser. (Working on refactoring harmony to use DelegatingParser as well).
  • Only reports how often the parser runs and returns results. Cannot distinguish between no tool invoked vs parser error. Parser exceptions are often caught internally and handled in each tool parser, so they never reach the common DelegatingParser interface.
    • Example: vllm/tool_parsers/gemma4_tool_parser.py non-streaming and streaming both catch any internal exception.
    • To measure parser error rate, we'd have to modify parsers on a case by case basis (or emit a shared failure signal).
  • Streaming parser is invoked once per delta while non-streaming parser is invoked once per request / choice. So the two numbers are not directly comparable.

Test Plan

Serve non-harmony model with --api-server-count 2

Test Result

After streaming Chat Completions request:

http_requests_total{handler="/v1/chat/completions",method="POST",status="2xx"} 2.0

# HELP vllm:tool_call_parser_invocations_total Total number of ToolParser invocations. Non-streaming increments once per choice; streaming increments once per delta.
# TYPE vllm:tool_call_parser_invocations_total counter
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="chat_completions"} 4.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="responses"} 0.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="other"} 0.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="chat_completions"} 148.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="responses"} 0.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="other"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="chat_completions"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="responses"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="other"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="chat_completions"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="responses"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="other"} 0.0

Note that the parser is invoked multiple times for a single request.

After streaming Responses request:

http_requests_total{handler="/v1/responses",method="POST",status="2xx"} 2.0

# HELP vllm:tool_call_parser_invocations_total Total number of ToolParser invocations. Non-streaming increments once per choice; streaming increments once per delta.
# TYPE vllm:tool_call_parser_invocations_total counter
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="chat_completions"} 0.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="responses"} 4.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="other"} 0.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="chat_completions"} 0.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="responses"} 136.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="other"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="chat_completions"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="responses"} 1.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="other"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="chat_completions"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="responses"} 1.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="other"} 0.0

Note that the parser is invoked multiple times for a single request. Both streaming and non-streaming paths are hit because Responses API reparses and sends the entire output after streaming.

Made with Cursor


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
Signed-off-by: Yifan Zong <yzong@redhat.com>
@yzong-rh

yzong-rh commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author
@mergify mergify Bot added the frontend label Jun 3, 2026
@yzong-rh yzong-rh marked this pull request as ready for review June 3, 2026 20:35
Comment thread vllm/envs.py Outdated
Signed-off-by: Yifan Zong <yzong@redhat.com>
Comment thread vllm/parser/abstract_parser.py
Signed-off-by: Yifan Zong <yzong@redhat.com>
Comment thread vllm/parser/metrics.py
Comment thread vllm/parser/metrics.py
Comment thread vllm/parser/metrics.py
Comment thread vllm/parser/metrics.py Outdated
request: object,
) -> None:
"""Increment the tool-call parser invocation counter when registered."""
if _tool_call_parser_invocations is not None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never be None. An assertion would be appropriate, unless we can just fine it at module-import

@yzong-rh yzong-rh Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests such as tests/parser/test_parse.py would fail because they use DelegatingParser but never register those metrics.

We could call init_parser_metrics inside record_tool_parser_invocation or within the tests instead.

Comment thread vllm/parser/metrics.py Outdated
_tool_call_parser_invocations.labels(
mode=mode,
outcome="tool_call" if tools_called else "no_tool_call",
request_type=request.__class__.__name__,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very important to bear in mind that every unique combination of label values in a Prometheus metric creates a separate time series that Prometheus tracks in memory, writes to disk, and queries independently.

If a metric has labels A, B, C with 10, 1000, and 20 possible values respectively, that's 10 × 1000 × 20 = 200,000 time series - each consuming ~1-2 KB of RAM in Prometheus. The effect is multiplicative, not additive.

The key rule: every label value must come from a small, bounded, known-in-advance set

Now, that is true in this case - 2 modes, tools_called=true|false, and request = (ChatCompletionRequest, ResponsesRequest) ... but it's very easy to imagine a future developer making a change which may not even directly touch the metrics code which causes an explosion of time series

So, some suggestions:

  1. Let's instantiate all these possible labelled children when we create the counter and then lookup the one we need in record_tool_parser_invocation()
  2. Define a static list of request types, so we can make people pause to think about the time series implications before adding more
  3. Put a request: ChatCompletionRequest | ResponsesRequest type hint on this function
  4. Use a name for the request type that is less likely to be changed with refactoring, because it needs to be a stable, public API - e.g. request_type=chat_completion

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add a note/comment about how you expect an error rate to be modelled in future

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, most of our metrics have model_name and engine_id labels. I'm not sure engine_id makes sense, but model_name does seem sensible to include?

@yzong-rh yzong-rh Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation. I applied your suggestions to ensure labels have low cardinalities.

Regarding model_name, AFAIK each vLLM instance can serve only 1 model at a time? Would it make sense to include model name? It would be hard to register model name labels ahead of time and they may not have low cardinalities.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_name would be useful to allow aggregating (in PromQL) this metric across vllm instances hosting the name model

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. Now includes model_name in the label.

yzong-rh added 2 commits June 4, 2026 13:48
Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan Zong <yzong@redhat.com>
@yzong-rh yzong-rh requested a review from markmc June 4, 2026 18:56
@markmc markmc moved this from Backlog to Ready in Prometheus Metrics Jun 8, 2026
@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 8, 2026
@markmc markmc enabled auto-merge (squash) June 8, 2026 12:59
Signed-off-by: Yifan Zong <yzong@redhat.com>
auto-merge was automatically disabled June 8, 2026 19:31

Head branch was pushed to by a user without write access

@mergify

mergify Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Hi @yzong-rh, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@markmc markmc changed the title [Metrics] Add rudimentary tool call Prometheus metrics. Jun 10, 2026
@markmc markmc enabled auto-merge (squash) June 10, 2026 14:14
@robertgshaw2-redhat robertgshaw2-redhat merged commit 6ec7dcd into vllm-project:main Jun 10, 2026
55 of 56 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in Prometheus Metrics Jun 10, 2026
wcynb1023 pushed a commit to wcynb1023/vllm that referenced this pull request Jun 11, 2026
…metheus metric (vllm-project#44448)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
…metheus metric (vllm-project#44448)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026
…metheus metric (vllm-project#44448)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…metheus metric (vllm-project#44448)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…metheus metric (vllm-project#44448)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…metheus metric (vllm-project#44448)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed

4 participants