Skip to content

[Bugfix][Parser] Fix U+FFFD leak at reasoning-to-content transition in engine parsers#46159

Merged
sfeng33 merged 2 commits into
vllm-project:mainfrom
bbrowning:reasoning-content-ufffd-leak
Jun 19, 2026
Merged

[Bugfix][Parser] Fix U+FFFD leak at reasoning-to-content transition in engine parsers#46159
sfeng33 merged 2 commits into
vllm-project:mainfrom
bbrowning:reasoning-content-ufffd-leak

Conversation

@bbrowning

Copy link
Copy Markdown
Collaborator

Purpose

Flush the reasoning parser's engine lexer at the reasoning→content transition instead of decoding isolated token IDs, which produces U+FFFD when byte-fallback tokens span delta boundaries.

This fixes a source of unicode replacement characters getting streamed back to clients first identified in #45919 (comment)

Test Plan

Unit Tests

pytest -v \
  tests/parser/engine \
  tests/reasoning/test_gemma4_reasoning_parser.py \
  tests/reasoning/test_glm4_moe_reasoning_parser.py \
  tests/reasoning/test_minimax_m2_reasoning_parser.py \
  tests/reasoning/test_qwen3_reasoning_parser.py \
  tests/tool_parsers/test_gemma4_tool_parser.py \
  tests/tool_parsers/test_glm47_moe_tool_parser.py \
  tests/tool_parsers/test_glm4_moe_tool_parser.py \
  tests/tool_parsers/test_minimax_m2_tool_parser.py \
  tests/tool_parsers/test_qwen3coder_tool_parser.py

Manual Test with curl

Serve a GLM 4.7 or newer model:

vllm serve zai-org/GLM-4.7-Flash \
     --tensor-parallel-size 4 \
     --speculative-config.method mtp \
     --speculative-config.num_speculative_tokens 1 \
     --tool-call-parser glm47 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --served-model-name glm-4.7-flash

Send it multiple requests in a loop that are likely to trigger a multi-byte unicode character around the reasoning to content transition. I ran this many times in a loop, dumping the output to a file, and then grepped the file for the unicode replacement characters.

curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model": "glm-4.7-flash", "messages": [{"role": "user", "content": "삼성전자 주주총회는 어디에서 열려?"}], "stream": true}'

Test Result

Unit Tests

2384 passed, 6 warnings

Manual Test with curl

Before the fix, I would regularly (maybe 20-40% of the time) see U+FFFD leaked as content around the reasoning to content transition in the output stream.

After the fix, I did not observe it at all.

Flush the reasoning parser's engine lexer at the reasoning→content
transition instead of decoding isolated token IDs, which produces
U+FFFD when byte-fallback tokens span delta boundaries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Ben Browning <bbrownin@redhat.com>

@sfeng33 sfeng33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick fix!

@sfeng33 sfeng33 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 19, 2026
@bbrowning

Copy link
Copy Markdown
Collaborator Author

The tests I added failed at most but not all chunk sizes before the fix. This makes sense, as whether the bug triggers or not depends on the exact sequence of tokens we got in each delta around the reasoning to tool parser transition.

The native model_tokenizer.decode call I had in there before would generate these replacement characters on incomplete multi-byte token sequences. It was an oversight on my part to ever do that versus just flushing anything held back in the reasoning parser around the transition, since the delta_text flowing into our parsers is already handling multi-byte unicode characters properly.

I've been running the reproducer from the original issue in a tight loop for a while now and things look good:

while true; do curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model": "glm-4.7-flash", "messages": [{"role": "user", "content": "삼성전자 주주총회는 어디에서 열려?"}], "stream": true}' >> stream.txt; done

And, no replacement characters to be found in that stream.txt I'm dumping the output to after the fix.

@sfeng33

sfeng33 commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

if we ran the parser for this model in "unified" mode (ie if we just return a Glm47MoeParser from ParserManager.get_parser when we detect the supported combo of GLM tool call parser and reasoning parser args instead of falling through to a DelegatingParser)

I’m curious if you’ve explored this path, which seems to be a clean fix as well (after this PR merged of course).

@bbrowning

Copy link
Copy Markdown
Collaborator Author

if we ran the parser for this model in "unified" mode (ie if we just return a Glm47MoeParser from ParserManager.get_parser when we detect the supported combo of GLM tool call parser and reasoning parser args instead of falling through to a DelegatingParser)

I’m curious if you’ve explored this path, which seems to be a clean fix as well (after this PR merged of course).

I did not explore that yet. From looking at the code, running them unified doesn't even run through the problematic code path so it should avoid this issue. But, I haven't done as much real-world testing of that path since we don't wire it up in the code today by matching compatible tool/reasoning parser args to collapse into the unified one. It is covered by the same set of unit tests as our delegating path and is less complex overall than running them as split delegating parsers. But, I didn't want to swap without putting that path through its paces first.

@bbrowning

Copy link
Copy Markdown
Collaborator Author

Ugh these openai server tests look unrelated ImportError: cannot import name 'GenerationConfig' from 'schemathesis' (/usr/local/lib/python3.12/dist-packages/schemathesis/__init__.py). Did you mean: 'GenerationMode'? - that schemathesis library cut a new release 5 hours ago that must have broken some API these tests use.

@sfeng33

sfeng33 commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Ops the security fix from #45675 made the resolver to use schemathesis 4.x.
To unblock this PR, can migrate the tests/entrypoints/openai/test_openai_schema.py to schemathesis 4.x

@sfeng33 sfeng33 enabled auto-merge (squash) June 19, 2026 14:32
@bbrowning

Copy link
Copy Markdown
Collaborator Author

Ops the security fix from #45675 made the resolver to use schemathesis 4.x. To unblock this PR, can migrate the tests/entrypoints/openai/test_openai_schema.py to schemathesis 4.x

Hmm - looking at that, it's not exactly trivial. The mechanical code changes to adapt to the new APIs are one thing, but the new version also tests lots of things in our schema that now fail but passed under schemathesis 3.x.

@Stranger6667

Copy link
Copy Markdown

Ops the security fix from #45675 made the resolver to use schemathesis 4.x. To unblock this PR, can migrate the tests/entrypoints/openai/test_openai_schema.py to schemathesis 4.x

Hmm - looking at that, it's not exactly trivial. The mechanical code changes to adapt to the new APIs are one thing, but the new version also tests lots of things in our schema that now fail but passed under schemathesis 3.x.

Feel free to tag me if you need any support from the Schemathesis side. There is also a migration guide which could help with migration

@sfeng33

sfeng33 commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Posted on slack to get help to force merge

@bbrowning

Copy link
Copy Markdown
Collaborator Author

See #46173 to get the unrelated schema test green on main again.

@bbrowning

Copy link
Copy Markdown
Collaborator Author

Stress-testing a GLM 4.7 model I also caught an edge case where a stop token could leak into message content before this fix but does not after this fix. So, the fix will help more than just U+FFFD characters leaking but instead fixes some broader issues with the GLM models and how the tokenization works. The text passed in to our deltas is already tokenized properly, so reusing that as this PR does simplifies handling all the tokenization edge cases.

@sfeng33 sfeng33 merged commit 859e4d4 into vllm-project:main Jun 19, 2026
50 checks passed
@aoshen02

Copy link
Copy Markdown
Collaborator

nice cc @cjackal

@bbrowning bbrowning deleted the reasoning-content-ufffd-leak branch June 20, 2026 03:39
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026
…n engine parsers (vllm-project#46159)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…n engine parsers (vllm-project#46159)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…n engine parsers (vllm-project#46159)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…n engine parsers (vllm-project#46159)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
hickeyma added a commit to hickeyma/vllm that referenced this pull request Jul 1, 2026
…rving

Part 1 of 3 for RFC vllm-project#47161. Adds a streaming mode to the derender
endpoints, mirroring the producer's per chunk SSE shape. State is kept
client carried and stateless on the server. Responses are returned as plain
JSON rather than SSE, since the derender can't buffer a whole generation or
hold a connection open.

Detokenization rebuilds the decode window from prior token IDs each
chunk by replaying it through detokenize_incrementally, rather than a
simple rebuild. The multi-byte characters split across chunk boundaries
still decode correctly (avoids regressing vllm-project#46159).

The chat path's reasoning/tool call parsing is stubbed for now (501)
and lands in a follow up PR.

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed tool-calling

4 participants