[Frontend][Metrics] Add vllm:tool_call_parser_invocations_total Prometheus metric#44448
Conversation
Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan Zong <yzong@redhat.com>
| request: object, | ||
| ) -> None: | ||
| """Increment the tool-call parser invocation counter when registered.""" | ||
| if _tool_call_parser_invocations is not None: |
There was a problem hiding this comment.
This should never be None. An assertion would be appropriate, unless we can just fine it at module-import
There was a problem hiding this comment.
Tests such as tests/parser/test_parse.py would fail because they use DelegatingParser but never register those metrics.
We could call init_parser_metrics inside record_tool_parser_invocation or within the tests instead.
| _tool_call_parser_invocations.labels( | ||
| mode=mode, | ||
| outcome="tool_call" if tools_called else "no_tool_call", | ||
| request_type=request.__class__.__name__, |
There was a problem hiding this comment.
It's very important to bear in mind that every unique combination of label values in a Prometheus metric creates a separate time series that Prometheus tracks in memory, writes to disk, and queries independently.
If a metric has labels A, B, C with 10, 1000, and 20 possible values respectively, that's 10 × 1000 × 20 = 200,000 time series - each consuming ~1-2 KB of RAM in Prometheus. The effect is multiplicative, not additive.
The key rule: every label value must come from a small, bounded, known-in-advance set
Now, that is true in this case - 2 modes, tools_called=true|false, and request = (ChatCompletionRequest, ResponsesRequest) ... but it's very easy to imagine a future developer making a change which may not even directly touch the metrics code which causes an explosion of time series
So, some suggestions:
- Let's instantiate all these possible labelled children when we create the counter and then lookup the one we need in
record_tool_parser_invocation() - Define a static list of request types, so we can make people pause to think about the time series implications before adding more
- Put a
request: ChatCompletionRequest | ResponsesRequesttype hint on this function - Use a name for the request type that is less likely to be changed with refactoring, because it needs to be a stable, public API - e.g.
request_type=chat_completion
There was a problem hiding this comment.
Might be good to add a note/comment about how you expect an error rate to be modelled in future
There was a problem hiding this comment.
Also, most of our metrics have model_name and engine_id labels. I'm not sure engine_id makes sense, but model_name does seem sensible to include?
There was a problem hiding this comment.
Thanks for explanation. I applied your suggestions to ensure labels have low cardinalities.
Regarding model_name, AFAIK each vLLM instance can serve only 1 model at a time? Would it make sense to include model name? It would be hard to register model name labels ahead of time and they may not have low cardinalities.
There was a problem hiding this comment.
model_name would be useful to allow aggregating (in PromQL) this metric across vllm instances hosting the name model
There was a problem hiding this comment.
Noted. Now includes model_name in the label.
Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan Zong <yzong@redhat.com>
Head branch was pushed to by a user without write access
|
Hi @yzong-rh, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
6ec7dcd
into
vllm-project:main
…metheus metric (vllm-project#44448) Signed-off-by: Yifan Zong <yzong@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
…metheus metric (vllm-project#44448) Signed-off-by: Yifan Zong <yzong@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
…metheus metric (vllm-project#44448) Signed-off-by: Yifan Zong <yzong@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
…metheus metric (vllm-project#44448) Signed-off-by: Yifan Zong <yzong@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: divineearthly <divineearthly@gmail.com>
…metheus metric (vllm-project#44448) Signed-off-by: Yifan Zong <yzong@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
…metheus metric (vllm-project#44448) Signed-off-by: Yifan Zong <yzong@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Purpose
Add a metric for tool parser activity so operators can see how often the parser runs and whether an invocation produced a tool call. This makes it easier to spot tool-calling regressions during model rollouts or runtime changes.
This PR adds the
vllm:tool_call_parser_invocations_totalcounter and records it inDelegatingParserfor both non-streaming and streaming tool parser calls, with labels for mode (streaming vs non-streaming), outcome (tool call vs no tool call), and request type (ChatCompletionRequestorResponsesRequest).Limitations
DelegatingParser. (Working on refactoring harmony to useDelegatingParseras well).DelegatingParserinterface.vllm/tool_parsers/gemma4_tool_parser.pynon-streaming and streaming both catch any internal exception.Test Plan
Serve non-harmony model with
--api-server-count 2Test Result
After streaming Chat Completions request:
Note that the parser is invoked multiple times for a single request.
After streaming Responses request:
Note that the parser is invoked multiple times for a single request. Both streaming and non-streaming paths are hit because Responses API reparses and sends the entire output after streaming.
Made with Cursor
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.