[Bugfix][Parser] Fix U+FFFD leak at reasoning-to-content transition in engine parsers by bbrowning · Pull Request #46159 · vllm-project/vllm

bbrowning · 2026-06-19T13:24:47Z

Purpose

Flush the reasoning parser's engine lexer at the reasoning→content transition instead of decoding isolated token IDs, which produces U+FFFD when byte-fallback tokens span delta boundaries.

This fixes a source of unicode replacement characters getting streamed back to clients first identified in #45919 (comment)

Test Plan

Unit Tests

pytest -v \
  tests/parser/engine \
  tests/reasoning/test_gemma4_reasoning_parser.py \
  tests/reasoning/test_glm4_moe_reasoning_parser.py \
  tests/reasoning/test_minimax_m2_reasoning_parser.py \
  tests/reasoning/test_qwen3_reasoning_parser.py \
  tests/tool_parsers/test_gemma4_tool_parser.py \
  tests/tool_parsers/test_glm47_moe_tool_parser.py \
  tests/tool_parsers/test_glm4_moe_tool_parser.py \
  tests/tool_parsers/test_minimax_m2_tool_parser.py \
  tests/tool_parsers/test_qwen3coder_tool_parser.py

Manual Test with curl

Serve a GLM 4.7 or newer model:

vllm serve zai-org/GLM-4.7-Flash \
     --tensor-parallel-size 4 \
     --speculative-config.method mtp \
     --speculative-config.num_speculative_tokens 1 \
     --tool-call-parser glm47 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --served-model-name glm-4.7-flash

Send it multiple requests in a loop that are likely to trigger a multi-byte unicode character around the reasoning to content transition. I ran this many times in a loop, dumping the output to a file, and then grepped the file for the unicode replacement characters.

curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model": "glm-4.7-flash", "messages": [{"role": "user", "content": "삼성전자 주주총회는 어디에서 열려?"}], "stream": true}'

Test Result

Unit Tests

2384 passed, 6 warnings

Manual Test with curl

Before the fix, I would regularly (maybe 20-40% of the time) see U+FFFD leaked as content around the reasoning to content transition in the output stream.

After the fix, I did not observe it at all.

Flush the reasoning parser's engine lexer at the reasoning→content transition instead of decoding isolated token IDs, which produces U+FFFD when byte-fallback tokens span delta boundaries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben Browning <bbrownin@redhat.com>

sfeng33

Thank you for the quick fix!

bbrowning · 2026-06-19T13:36:51Z

The tests I added failed at most but not all chunk sizes before the fix. This makes sense, as whether the bug triggers or not depends on the exact sequence of tokens we got in each delta around the reasoning to tool parser transition.

The native model_tokenizer.decode call I had in there before would generate these replacement characters on incomplete multi-byte token sequences. It was an oversight on my part to ever do that versus just flushing anything held back in the reasoning parser around the transition, since the delta_text flowing into our parsers is already handling multi-byte unicode characters properly.

I've been running the reproducer from the original issue in a tight loop for a while now and things look good:

while true; do curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model": "glm-4.7-flash", "messages": [{"role": "user", "content": "삼성전자 주주총회는 어디에서 열려?"}], "stream": true}' >> stream.txt; done

And, no replacement characters to be found in that stream.txt I'm dumping the output to after the fix.

sfeng33 · 2026-06-19T13:48:53Z

if we ran the parser for this model in "unified" mode (ie if we just return a Glm47MoeParser from ParserManager.get_parser when we detect the supported combo of GLM tool call parser and reasoning parser args instead of falling through to a DelegatingParser)

I’m curious if you’ve explored this path, which seems to be a clean fix as well (after this PR merged of course).

bbrowning · 2026-06-19T13:56:42Z

if we ran the parser for this model in "unified" mode (ie if we just return a Glm47MoeParser from ParserManager.get_parser when we detect the supported combo of GLM tool call parser and reasoning parser args instead of falling through to a DelegatingParser)

I’m curious if you’ve explored this path, which seems to be a clean fix as well (after this PR merged of course).

I did not explore that yet. From looking at the code, running them unified doesn't even run through the problematic code path so it should avoid this issue. But, I haven't done as much real-world testing of that path since we don't wire it up in the code today by matching compatible tool/reasoning parser args to collapse into the unified one. It is covered by the same set of unit tests as our delegating path and is less complex overall than running them as split delegating parsers. But, I didn't want to swap without putting that path through its paces first.

bbrowning · 2026-06-19T13:57:29Z

Ugh these openai server tests look unrelated ImportError: cannot import name 'GenerationConfig' from 'schemathesis' (/usr/local/lib/python3.12/dist-packages/schemathesis/__init__.py). Did you mean: 'GenerationMode'? - that schemathesis library cut a new release 5 hours ago that must have broken some API these tests use.

sfeng33 · 2026-06-19T14:17:39Z

Ops the security fix from #45675 made the resolver to use schemathesis 4.x.
To unblock this PR, can migrate the tests/entrypoints/openai/test_openai_schema.py to schemathesis 4.x

bbrowning · 2026-06-19T14:40:21Z

Ops the security fix from #45675 made the resolver to use schemathesis 4.x. To unblock this PR, can migrate the tests/entrypoints/openai/test_openai_schema.py to schemathesis 4.x

Hmm - looking at that, it's not exactly trivial. The mechanical code changes to adapt to the new APIs are one thing, but the new version also tests lots of things in our schema that now fail but passed under schemathesis 3.x.

Stranger6667 · 2026-06-19T15:23:15Z

Ops the security fix from #45675 made the resolver to use schemathesis 4.x. To unblock this PR, can migrate the tests/entrypoints/openai/test_openai_schema.py to schemathesis 4.x

Hmm - looking at that, it's not exactly trivial. The mechanical code changes to adapt to the new APIs are one thing, but the new version also tests lots of things in our schema that now fail but passed under schemathesis 3.x.

Feel free to tag me if you need any support from the Schemathesis side. There is also a migration guide which could help with migration

sfeng33 · 2026-06-19T15:46:20Z

Posted on slack to get help to force merge

bbrowning · 2026-06-19T16:25:04Z

See #46173 to get the unrelated schema test green on main again.

bbrowning · 2026-06-19T21:17:06Z

Stress-testing a GLM 4.7 model I also caught an edge case where a stop token could leak into message content before this fix but does not after this fix. So, the fix will help more than just U+FFFD characters leaking but instead fixes some broader issues with the GLM models and how the tokenization works. The text passed in to our deltas is already tokenized properly, so reusing that as this PR does simplifies handling all the tokenization edge cases.

aoshen02 · 2026-06-19T23:08:28Z

nice cc @cjackal

…n engine parsers (vllm-project#46159) Signed-off-by: Ben Browning <bbrownin@redhat.com>

…n engine parsers (vllm-project#46159) Signed-off-by: Ben Browning <bbrownin@redhat.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

…rving Part 1 of 3 for RFC vllm-project#47161. Adds a streaming mode to the derender endpoints, mirroring the producer's per chunk SSE shape. State is kept client carried and stateless on the server. Responses are returned as plain JSON rather than SSE, since the derender can't buffer a whole generation or hold a connection open. Detokenization rebuilds the decode window from prior token IDs each chunk by replaying it through detokenize_incrementally, rather than a simple rebuild. The multi-byte characters split across chunk boundaries still decode correctly (avoids regressing vllm-project#46159). The chat path's reasoning/tool call parsing is stubbed for now (501) and lands in a follow up PR. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>

bbrowning requested review from aarnphm, chaunceyjiang and sfeng33 as code owners June 19, 2026 13:24

mergify Bot added tool-calling bug Something isn't working labels Jun 19, 2026

github-project-automation Bot added this to Tool Calling Jun 19, 2026

sfeng33 approved these changes Jun 19, 2026

View reviewed changes

sfeng33 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 19, 2026

sfeng33 enabled auto-merge (squash) June 19, 2026 14:32

Merge branch 'main' into reasoning-content-ufffd-leak

bc36bbc

bbrowning mentioned this pull request Jun 19, 2026

[Frontend] Add Streaming Parser Engine and new Qwen3 Parser #45413

Merged

sfeng33 merged commit 859e4d4 into vllm-project:main Jun 19, 2026
50 checks passed

github-project-automation Bot moved this to Done in Tool Calling Jun 19, 2026

bbrowning deleted the reasoning-content-ufffd-leak branch June 20, 2026 03:39

fank mentioned this pull request Jun 20, 2026

[Bug]: Streaming parser engine leaks special/structural tokens (BOS/EOS/drop_tokens) in non-streaming responses #46224

Closed

1 task

voipmonitor mentioned this pull request Jun 20, 2026

Avoid streaming tool parser for requests without tools #46183

Closed

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026

[Bugfix][Parser] Fix U+FFFD leak at reasoning-to-content transition i…

efce875

…n engine parsers (vllm-project#46159) Signed-off-by: Ben Browning <bbrownin@redhat.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026

[Bugfix][Parser] Fix U+FFFD leak at reasoning-to-content transition i…

df87d59

…n engine parsers (vllm-project#46159) Signed-off-by: Ben Browning <bbrownin@redhat.com>

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Bugfix][Parser] Fix U+FFFD leak at reasoning-to-content transition i…

5386d15

…n engine parsers (vllm-project#46159) Signed-off-by: Ben Browning <bbrownin@redhat.com>

This was referenced Jun 30, 2026

[RFC]: Derender Endpoints to Provide Detokenization for Disaggregated Serving #42729

Open

[RFC]: Streaming Derender for Disaggregated Serving #47161

Open

[Frontend] Add detokenization streaming derender for disaggregated serving #47301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][Parser] Fix U+FFFD leak at reasoning-to-content transition in engine parsers#46159

[Bugfix][Parser] Fix U+FFFD leak at reasoning-to-content transition in engine parsers#46159
sfeng33 merged 2 commits into
vllm-project:mainfrom
bbrowning:reasoning-content-ufffd-leak

bbrowning commented Jun 19, 2026

sfeng33 left a comment

bbrowning commented Jun 19, 2026

sfeng33 commented Jun 19, 2026

bbrowning commented Jun 19, 2026

bbrowning commented Jun 19, 2026

sfeng33 commented Jun 19, 2026

bbrowning commented Jun 19, 2026

Stranger6667 commented Jun 19, 2026

sfeng33 commented Jun 19, 2026

bbrowning commented Jun 19, 2026

bbrowning commented Jun 19, 2026

Uh oh!

aoshen02 commented Jun 19, 2026

Labels

4 participants

Uh oh!

Uh oh!

Conversation

bbrowning commented Jun 19, 2026

Purpose

Test Plan

Unit Tests

Manual Test with curl

Test Result

Unit Tests

Manual Test with curl

sfeng33 left a comment

Choose a reason for hiding this comment

bbrowning commented Jun 19, 2026

sfeng33 commented Jun 19, 2026

bbrowning commented Jun 19, 2026

bbrowning commented Jun 19, 2026

sfeng33 commented Jun 19, 2026

bbrowning commented Jun 19, 2026

Stranger6667 commented Jun 19, 2026

sfeng33 commented Jun 19, 2026

bbrowning commented Jun 19, 2026

bbrowning commented Jun 19, 2026

Uh oh!

aoshen02 commented Jun 19, 2026

Labels

4 participants