[Frontend][Metrics] Add `vllm:tool_call_parser_invocations_total` Prometheus metric by yzong-rh · Pull Request #44448 · vllm-project/vllm

yzong-rh · 2026-06-03T20:34:20Z

Purpose

Add a metric for tool parser activity so operators can see how often the parser runs and whether an invocation produced a tool call. This makes it easier to spot tool-calling regressions during model rollouts or runtime changes.

This PR adds the vllm:tool_call_parser_invocations_total counter and records it in DelegatingParser for both non-streaming and streaming tool parser calls, with labels for mode (streaming vs non-streaming), outcome (tool call vs no tool call), and request type (ChatCompletionRequest or ResponsesRequest).

Limitations

Covers the non-harmony path only. Harmony path does not yet go through the DelegatingParser. (Working on refactoring harmony to use DelegatingParser as well).
Only reports how often the parser runs and returns results. Cannot distinguish between no tool invoked vs parser error. Parser exceptions are often caught internally and handled in each tool parser, so they never reach the common DelegatingParser interface.
- Example: vllm/tool_parsers/gemma4_tool_parser.py non-streaming and streaming both catch any internal exception.
- To measure parser error rate, we'd have to modify parsers on a case by case basis (or emit a shared failure signal).
Streaming parser is invoked once per delta while non-streaming parser is invoked once per request / choice. So the two numbers are not directly comparable.

Test Plan

Serve non-harmony model with --api-server-count 2

Test Result

After streaming Chat Completions request:

http_requests_total{handler="/v1/chat/completions",method="POST",status="2xx"} 2.0

# HELP vllm:tool_call_parser_invocations_total Total number of ToolParser invocations. Non-streaming increments once per choice; streaming increments once per delta.
# TYPE vllm:tool_call_parser_invocations_total counter
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="chat_completions"} 4.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="responses"} 0.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="other"} 0.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="chat_completions"} 148.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="responses"} 0.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="other"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="chat_completions"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="responses"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="other"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="chat_completions"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="responses"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="other"} 0.0

Note that the parser is invoked multiple times for a single request.

After streaming Responses request:

http_requests_total{handler="/v1/responses",method="POST",status="2xx"} 2.0

# HELP vllm:tool_call_parser_invocations_total Total number of ToolParser invocations. Non-streaming increments once per choice; streaming increments once per delta.
# TYPE vllm:tool_call_parser_invocations_total counter
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="chat_completions"} 0.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="responses"} 4.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="other"} 0.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="chat_completions"} 0.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="responses"} 136.0
vllm:tool_call_parser_invocations_total{mode="streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="other"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="chat_completions"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="responses"} 1.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="tool_call",request_type="other"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="chat_completions"} 0.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="responses"} 1.0
vllm:tool_call_parser_invocations_total{mode="non_streaming",model_name="Qwen/Qwen3.6-35B-A3B-FP8",outcome="no_tool_call",request_type="other"} 0.0

Note that the parser is invoked multiple times for a single request. Both streaming and non-streaming paths are hit because Responses API reparses and sends the entire output after streaming.

Made with Cursor

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Yifan Zong <yzong@redhat.com>

yzong-rh · 2026-06-03T20:34:58Z

cc @markmc @robertgshaw2-redhat

Signed-off-by: Yifan Zong <yzong@redhat.com>

markmc · 2026-06-04T13:45:27Z

+    request: object,
+) -> None:
+    """Increment the tool-call parser invocation counter when registered."""
+    if _tool_call_parser_invocations is not None:


This should never be None. An assertion would be appropriate, unless we can just fine it at module-import

Tests such as tests/parser/test_parse.py would fail because they use DelegatingParser but never register those metrics.

We could call init_parser_metrics inside record_tool_parser_invocation or within the tests instead.

markmc · 2026-06-04T13:58:46Z

+        _tool_call_parser_invocations.labels(
+            mode=mode,
+            outcome="tool_call" if tools_called else "no_tool_call",
+            request_type=request.__class__.__name__,


It's very important to bear in mind that every unique combination of label values in a Prometheus metric creates a separate time series that Prometheus tracks in memory, writes to disk, and queries independently.

If a metric has labels A, B, C with 10, 1000, and 20 possible values respectively, that's 10 × 1000 × 20 = 200,000 time series - each consuming ~1-2 KB of RAM in Prometheus. The effect is multiplicative, not additive.

The key rule: every label value must come from a small, bounded, known-in-advance set

Now, that is true in this case - 2 modes, tools_called=true|false, and request = (ChatCompletionRequest, ResponsesRequest) ... but it's very easy to imagine a future developer making a change which may not even directly touch the metrics code which causes an explosion of time series

So, some suggestions:

Let's instantiate all these possible labelled children when we create the counter and then lookup the one we need in record_tool_parser_invocation()

Define a static list of request types, so we can make people pause to think about the time series implications before adding more

Put a request: ChatCompletionRequest | ResponsesRequest type hint on this function

Use a name for the request type that is less likely to be changed with refactoring, because it needs to be a stable, public API - e.g. request_type=chat_completion

Might be good to add a note/comment about how you expect an error rate to be modelled in future

Also, most of our metrics have model_name and engine_id labels. I'm not sure engine_id makes sense, but model_name does seem sensible to include?

Thanks for explanation. I applied your suggestions to ensure labels have low cardinalities.

Regarding model_name, AFAIK each vLLM instance can serve only 1 model at a time? Would it make sense to include model name? ~~It would be hard to register model name labels ahead of time and they may not have low cardinalities.~~

model_name would be useful to allow aggregating (in PromQL) this metric across vllm instances hosting the name model

Noted. Now includes model_name in the label.

Signed-off-by: Yifan Zong <yzong@redhat.com>

mergify · 2026-06-08T19:36:09Z

Hi @yzong-rh, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

…metheus metric (vllm-project#44448) Signed-off-by: Yifan Zong <yzong@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

…metheus metric (vllm-project#44448) Signed-off-by: Yifan Zong <yzong@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

…metheus metric (vllm-project#44448) Signed-off-by: Yifan Zong <yzong@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

Add mvp

f93575b

Signed-off-by: Yifan Zong <yzong@redhat.com>

mergify Bot added the frontend label Jun 3, 2026

yzong-rh marked this pull request as ready for review June 3, 2026 20:35

yzong-rh requested review from DarkLight1337, aarnphm, bbrowning, chaunceyjiang, russellb and sfeng33 as code owners June 3, 2026 20:35

robertgshaw2-redhat reviewed Jun 3, 2026

View reviewed changes

Comment thread vllm/envs.py Outdated

Remove env variable

cdb3f42

Signed-off-by: Yifan Zong <yzong@redhat.com>

yzong-rh requested a review from robertgshaw2-redhat June 4, 2026 03:49

chaunceyjiang reviewed Jun 4, 2026

View reviewed changes

Comment thread vllm/parser/abstract_parser.py

Addr comments.

ba2c82a

Signed-off-by: Yifan Zong <yzong@redhat.com>

chaunceyjiang reviewed Jun 4, 2026

View reviewed changes

Comment thread vllm/parser/metrics.py

markmc added this to Prometheus Metrics Jun 4, 2026

github-project-automation Bot moved this to Backlog in Prometheus Metrics Jun 4, 2026

markmc requested changes Jun 4, 2026

View reviewed changes

yzong-rh added 2 commits June 4, 2026 13:48

Addr comments

a1627c0

Signed-off-by: Yifan Zong <yzong@redhat.com>

Fix typo

0853b19

Signed-off-by: Yifan Zong <yzong@redhat.com>

yzong-rh requested a review from markmc June 4, 2026 18:56

markmc moved this from Backlog to Ready in Prometheus Metrics Jun 8, 2026

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 8, 2026

markmc approved these changes Jun 8, 2026

View reviewed changes

Merge branch 'main' into yzong-rh/tool_parser_metrics

0c0418e

markmc enabled auto-merge (squash) June 8, 2026 12:59

Log model name

c5cc39f

Signed-off-by: Yifan Zong <yzong@redhat.com>

auto-merge was automatically disabled June 8, 2026 19:31
Head branch was pushed to by a user without write access

Merge branch 'main' into yzong-rh/tool_parser_metrics

2688650

markmc approved these changes Jun 10, 2026

View reviewed changes

markmc changed the title ~~[Metrics] Add rudimentary tool call Prometheus metrics.~~ Jun 10, 2026

markmc enabled auto-merge (squash) June 10, 2026 14:14

robertgshaw2-redhat disabled auto-merge June 10, 2026 14:29

robertgshaw2-redhat merged commit 6ec7dcd into vllm-project:main Jun 10, 2026
55 of 56 checks passed

github-project-automation Bot moved this from Ready to Done in Prometheus Metrics Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Frontend][Metrics] Add `vllm:tool_call_parser_invocations_total` Prometheus metric#44448

[Frontend][Metrics] Add `vllm:tool_call_parser_invocations_total` Prometheus metric#44448
robertgshaw2-redhat merged 8 commits into
vllm-project:mainfrom
yzong-rh:yzong-rh/tool_parser_metrics

yzong-rh commented Jun 3, 2026 •

edited

Loading

yzong-rh commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markmc Jun 4, 2026

yzong-rh Jun 4, 2026 •

edited

Loading

markmc Jun 4, 2026

markmc Jun 4, 2026

markmc Jun 4, 2026

yzong-rh Jun 4, 2026 •

edited

Loading

markmc Jun 8, 2026

yzong-rh Jun 8, 2026

mergify Bot commented Jun 8, 2026

Uh oh!

Labels

4 participants

Uh oh!

Uh oh!

Conversation

yzong-rh commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Limitations

Test Plan

Test Result

yzong-rh commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markmc Jun 4, 2026

Choose a reason for hiding this comment

yzong-rh Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

markmc Jun 4, 2026

Choose a reason for hiding this comment

markmc Jun 4, 2026

Choose a reason for hiding this comment

markmc Jun 4, 2026

Choose a reason for hiding this comment

yzong-rh Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

markmc Jun 8, 2026

Choose a reason for hiding this comment

yzong-rh Jun 8, 2026

Choose a reason for hiding this comment

mergify Bot commented Jun 8, 2026

Uh oh!

Labels

4 participants

yzong-rh commented Jun 3, 2026 •

edited

Loading

yzong-rh Jun 4, 2026 •

edited

Loading

yzong-rh Jun 4, 2026 •

edited

Loading