[Frontend] Add Streaming Parser Engine and new GLM4.7/GLM5.1/GLM5.2 Parser by chaunceyjiang · Pull Request #45915 · vllm-project/vllm

chaunceyjiang · 2026-06-17T08:46:19Z

Purpose

[Frontend] Add Streaming Parser Engine and new GLM4.7/GLM5.1/GLM5.2 Parser

Test Plan

I tested it locally with both enable_thinking=True and enable_thinking=False, as well as with stream=True and stream=False. In all cases, the output was parsed correctly.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

…arser Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

bbrowning · 2026-06-18T01:51:26Z

This looks directionally correct, although I haven't tested it against a live version of these models yet.

One note - this isn't wired into tests/parser/engine/trace_builder.py so pytest -v tests/parser/engine/test_*replay.py currently fails with:

RuntimeError: Engine adapters in registered_adapters have no test builder in trace_builder._BUILDERS: Glm47MoeParser (config.name='glm47_moe'). Add a builder to _BUILDERS for each new parser.

We can relax this if desired, but at least for all these initial ones I added that to ensure that we wire new parsers into the replay harnesses which have a lot of shared parsing tests at various token sizes per delta to fuzz out any bad behavior across token boundaries.

Expand the details section immediately below here to see a diff of what I think that would be, but I didn't double-check that this is exactly following the GLM 4.7 format. It looks like the docstring in the parser file though.

trace_builder.py changes

diff --git a/tests/parser/engine/trace_builder.py b/tests/parser/engine/trace_builder.py
index 128e511e6..c7a722bc4 100644
--- a/tests/parser/engine/trace_builder.py
+++ b/tests/parser/engine/trace_builder.py
@@ -30,6 +30,7 @@ from vllm.entrypoints.openai.chat_completion.protocol import (
 )
 from vllm.parser.engine.registered_adapters import (
     Gemma4Parser,
+    Glm47MoeParser,
     MinimaxM2Parser,
     NemotronV3Parser,
     Qwen3Parser,
@@ -571,6 +572,75 @@ def _build_nemotron_v3(scenario: Scenario, validate: bool = True) -> Sample:
     )
 
 
+# ── GLM-4.7 (XML arg_key/arg_value format, starts in REASONING) ────
+
+_GLM47_MOE_VOCAB: dict[str, int] = {
+    "<think>": 50,
+    "</think>": 51,
+    "<tool_call>": 60,
+    "</tool_call>": 61,
+}
+
+
+def _glm47_moe_arg_value(value: Any) -> str:
+    if isinstance(value, str):
+        return value
+    if isinstance(value, bool):
+        return "true" if value else "false"
+    if isinstance(value, (int, float)):
+        return str(value)
+    return json.dumps(value, ensure_ascii=False)
+
+
+def _glm47_moe_tool_segments(tc: ToolCallSpec) -> list[tuple[str, bool]]:
+    segs: list[tuple[str, bool]] = [("<tool_call>", True)]
+    parts = [tc.name]
+    for key, value in tc.arguments.items():
+        val_str = _glm47_moe_arg_value(value)
+        parts.append(
+            f"<arg_key>{key}</arg_key><arg_value>{val_str}</arg_value>"
+        )
+    segs.append(("".join(parts), False))
+    segs.append(("</tool_call>", True))
+    return segs
+
+
+def _glm47_moe_segments(scenario: Scenario) -> list[tuple[str, bool]]:
+    segs: list[tuple[str, bool]] = []
+    if scenario.reasoning is not None:
+        segs.append((scenario.reasoning, False))
+    if scenario.content is not None or scenario.tool_calls:
+        segs.append(("</think>", True))
+    if scenario.content is not None:
+        segs.append((scenario.content, False))
+    if scenario.tool_calls:
+        for tc in scenario.tool_calls:
+            segs.extend(_glm47_moe_tool_segments(tc))
+    return segs
+
+
+def _build_glm47_moe(scenario: Scenario, validate: bool = True) -> Sample:
+    expected_reasoning: str | None
+    if scenario.reasoning is not None:
+        expected_reasoning = scenario.reasoning.rstrip()
+    else:
+        expected_reasoning = ""
+
+    sample = _make_sample(
+        sample_id=f"glm47_moe-{scenario.id}",
+        description=scenario.description,
+        vocab=_GLM47_MOE_VOCAB,
+        segments=_glm47_moe_segments(scenario),
+        expected_reasoning=expected_reasoning,
+        expected_content=_qwen3_expected_content(scenario),
+        expected_tool_calls=_expected_tc(scenario),
+        tools=_expected_tools(scenario),
+    )
+    if validate:
+        _validate_sample(sample, Glm47MoeParser)
+    return sample
+
+
 # ── Registry and public API ──────────────────────────────────────────
 
 _BUILDERS: dict[str, Any] = {
@@ -578,6 +648,7 @@ _BUILDERS: dict[str, Any] = {
     "gemma4": _build_gemma4,
     "minimax_m2": _build_minimax_m2,
     "nemotron_v3": _build_nemotron_v3,
+    "glm47_moe": _build_glm47_moe,
 }

After adding this parser to trace_builder.py, you can confirm it works with pytest -v tests/parser/engine/ -k glm47 and you'll see it run a few hundred quick unit tests to test different boundary conditions at various chunk sizes.

sfeng33 · 2026-06-18T03:49:00Z

Is it feasible to migrate both glm 45 and 47 in this PR, the two tool parser share 99% of the code.

…arser Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang · 2026-06-18T06:11:37Z

-class Glm47MoeModelToolParser(Glm4MoeModelToolParser):
+class Glm47MoeModelToolParser(Glm47MoeParserToolAdapter):  # type: ignore[valid-type, misc]
    supports_required_and_named = False
    structural_tag_model = "glm_4_7"


I can confirm that the glm_4_7 structural tag is also compatible with the glm_4_5 format.

bbrowning

This looks good to me - your update to the trace_builder.py did a better job of representing the streaming tokens properly for those tests than my quickly put together version. And from what I dug into the differences in the old glm45 vs glm47 tool parser, this looks safe to consolidate under the new version.

Approving this without running myself on a live model, as the tests give good confidence in the parsing behavior here for many cases. With the recent release of GLM 5.2 and its improved MTP, this will substantially clean up that path for streaming usage.

Thanks!

sfeng33 · 2026-06-18T16:13:57Z

✅ Reasoning parser swallows tool tokens (the central bug)

Issues: #42400 (GLM-5.1 Claude Code: stop_reason=tool_use but no tool block), #46040 (GLM-5.2 emits <tool_call> inside , XML leaks into reasoning), framing of #46049. Competing PR: #40659.
Root cause: the old glm45 reasoning parser consumed everything up to as reasoning, so a <tool_call> emitted before was never handed to the tool parser.
Verdict: FIXED — structurally. With the unified grammar, reasoning ends at <tool_call> regardless of . This supersedes the band-aid in #40659 (which special-cased the -already-seen token). High confidence.

✅ Streaming tool name truncated / unstable

Issues: #39757 (run_in_terminal→run_in, get_weather→get). Competing PRs: #40071 (delay prefix names), #41654 (zero-arg names).
Root cause: name streamed as a prefix before the full name region closed; spec-decode/MTP amplified it.
Verdict: FIXED. The engine accumulates the entire TOOL_NAME region (until <arg_key> or </tool_call>) and emits the name once, whole, gated by validate_tool_names (find_tool_name against declared tools). A prefix that isn't a real tool name is never emitted. Supersedes #40071.

✅ Zero-argument inline tool calls dropped in streaming

Issue: #44326. Competing PR: #41654.
Root cause: old streaming name-extraction returned None for <tool_call>get_current_time</tool_call> (no \n/<arg_key>), so the call vanished in streaming while non-streaming worked.
Verdict: FIXED. (TOOL_NAME, TOOL_END)→TOOL_CALL_END; _handle_tool_end emits the buffered name with arguments="". The PR adds explicit tests test_zero_arg_inline / test_no_args (streaming). High confidence.

✅ Streaming argument-JSON corruption

Issue: #40195 (Optional[str] → Smithh", arrays → [...]]). Competing PR: #40197.
Root cause: partial renders weren't byte-prefixes of the final render; length-slicing duplicated chars.
Verdict: FIXED. _safe_arg_prefix() only streams up to the last complete top-level value (never the in-flight one), with a hard startswith(prev) invariant; the unstable trailing value is flushed only at tool end. Plus schema-aware _fix_arg_types. This is the same fix #40197 proposed, now built into the engine.

🟡 MTP + tool-calling malformed args

Issues: #44843 (vanishes when MTP off), args part of #39757.
Verdict: LIKELY FIXED at the parser level (the engine re-derives from accumulated text+token-ids, so it's robust to MTP token-boundary differences) — but lower confidence, because if MTP corrupts the actual generated text (not just token boundaries) no parser can repair it. Needs a real GLM+MTP run to confirm.

✅ Responses API format mismatch

Issue: #45273 (parser part: AttributeError: 'FunctionTool' object has no attribute 'function'). Competing PRs: #45276, #41631.
Verdict: FIXED for the parser path. The engine reads tool metadata via utils.find_tool_name/find_tool_properties, which handle both shapes (tool.name for FunctionTool and tool.function.name). And the engine's adjust_request only sets skip_special_tokens=False — it never injects a Hermes JSON schema, so #41631 (schema injection for Responses named-function) is sidestepped entirely.
Note: #45273's other failures (request-validation 400/500, empty-content IndexError, historical-args JSONDecodeError) live in the Responses API layer and are out of scope of #45915.

🟡 Serving-layer streaming chunk shape

Issue: #44098 (continuation chunks re-emit id/type/name; last arg packed into the finish_reason chunk).
Verdict: PARTIALLY. The engine's own continuation deltas carry only {index, arguments} (no metadata re-emission), and _flush_engine_parsers replaces the buggy _create_remaining_args_delta path — so bug 1 is structurally avoided. Bug 2 (final arg fragment delivered in the same chunk as finish_reason) is a stream-generator concern that this PR doesn't clearly change. Lower confidence; worth a targeted check.

🔴 Reasoning token counting

Issue / competing PR: #41077.
Verdict: LIKELY NOT FIXED (gap). The engine's count_reasoning_tokens requires seeing THINK_START (depth>0) to count. GLM injects in the prompt, so generated output often has no start token → returns 0, the exact #41077 bug. #41077's specific remedy (count everything before the first end-id when no start present) is not replicated. Low severity (usage accounting), but a real regression-or-no-fix.

🔴 Missing classification / GLM-4.5 & SeedOSS regressions

Issue / competing PR: #37044.
Verdict: BEHAVIOR CHANGED, partially out of scope. New engine starts in REASONING, so tagless output is classified as reasoning (asserted by tor a reasoning model that always closes this is fine; if a GLM model emits content with no at all it'd be misclassified — the
opposite of #37044's intent. SeedOSS (the other half of #37044) is untouched and out of scope.

🔴 Tool-result rendering / content-format (input side)

Issue / competing PR: #39630 / #39614.
Verdict: NOT FIXED — out of scope. This is prompt rendering in vllm/renderers/hf.py, which #45915 doesn't touch. Unrelated to output parsing.

Bottom line

PR #45915 is not a point-fix — it's a rewrite that collapses the reasoning+tool split into one grammar, which is exactly the seam that caused ugs.- Solidly fixed: A (tool-in-reasoning — the big one), B (name truncation), C (zero-arg streaming), D (arg-JSON corruption), F (Responses FunctThat covers the bulk of open issues #42400, #46040, #44326, #39757, #40195, and the parser slice of #45273 — and supersedes competing PRs#40659, #40071, #41654, #40197.

Likely / partial: E (MTP args — parser-level only), G ([Bug]: GLM tool-call streaming final chunks repeat metadata and combine arguments with finish_reason #44098 — metadata fixed, finish-chunk shape uncertain).
Gaps / out of scope: H ([Bugfix] Fix GLM45 reasoning token counting #41077 token counting — probable regression), I ([Bugfix] Fix GLM4 MoE and SeedOSS reasoning parser regressions #37044 SeedOSS + tagless semantics), J (Fix GLM tool results with auto chat template content format #39630 input rendering), K stale).

sfeng33

Thank you for the work!

jinbagi · 2026-06-18T17:24:23Z

Nice work!

frankwang28 · 2026-06-19T12:39:30Z

Hi! I'm wondering if the PR / fix is in the vllm/vllm-openai:glm52 Docker image? Thanks!

…arser (vllm-project#45915) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

…arser (vllm-project#45915) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

gaby · 2026-06-25T00:54:47Z

@frankwang28 It is not, you have to use nightly.

@sfeng33 Even with this fix, we see weird behavior when mtp is enabled for tool calling with output validation.

It did improve this a lot thought, compared to initial release of GLM-5.x support

sfeng33 · 2026-06-25T01:00:30Z

@gaby thanks for taking a look! Yeah MTP is a known issue across several models.

bbrowning · 2026-06-25T10:07:28Z

Even with this fix, we see weird behavior when mtp is enabled for tool calling with output validation.

What is the weird behavior you see? Tool call parse or reasoning failures such as content cut off, tags escaping, etc? Errors in requests and vLLM logs about failed FSM advances in the grammar? Or more subtle issues such as the model not calling the right tools or forgetting what tools it has?

I just want to confirm which of these (or something else) mtp is triggering so we can reproduce and fix it.

gaby · 2026-06-25T12:29:44Z

Even with this fix, we see weird behavior when mtp is enabled for tool calling with output validation.

What is the weird behavior you see? Tool call parse or reasoning failures such as content cut off, tags escaping, etc? Errors in requests and vLLM logs about failed FSM advances in the grammar? Or more subtle issues such as the model not calling the right tools or forgetting what tools it has?

I just want to confirm which of these (or something else) mtp is triggering so we can reproduce and fix it.

@bbrowning I don't have an example code to share, but behavior I do. Ex when using claude code in plan mode, the model is not able to call tools to exit plan mode "ExitPlanMode". It would say it called the tool, but doesn't.

On vLLM everything returns HTTP 200, I was able to mitigate the issue by passing --chat-template-content-format string to the engine. Before adding that around 40% of tool calls were failing.

When using structured format, the model would sometimes return the data with a different format, ex:

Expected: {"temperature": 70}
Model sometimes returns: `json ```{"temperature": 70}````

bbrowning · 2026-06-25T14:11:02Z

@gaby If you feel up for testing something, try setting --kv-cache-dtype fp8 or --no-enable-prefix-caching with MTP enabled and try again. I've observed a few models with interaction issues between kv caching and MTP but haven't been able to narrow it down further yet. The way it typically manifests for me in agentic scenarios is the model stops calling tools, hallucinates tools, outputs things in the wrong format, etc. I think it's done to some kind of missing, corrupt, or stale data being read from the prefix cache at least in the cases I've observed with Nemotron 3 Super and Qwen 3.6 models.

gaby · 2026-06-25T14:14:04Z

@bbrowning I will give it a try, i'm already using --kv-cache-dtype fp8_e4m3, but I do have prefix caching enabled. Which was not in the official recipe for GLM-5.2

HyunCello · 2026-06-25T17:40:03Z

Even with this fix, we see weird behavior when mtp is enabled for tool calling with output validation.

What is the weird behavior you see? Tool call parse or reasoning failures such as content cut off, tags escaping, etc? Errors in requests and vLLM logs about failed FSM advances in the grammar? Or more subtle issues such as the model not calling the right tools or forgetting what tools it has?
I just want to confirm which of these (or something else) mtp is triggering so we can reproduce and fix it.

@bbrowning I don't have an example code to share, but behavior I do. Ex when using claude code in plan mode, the model is not able to call tools to exit plan mode "ExitPlanMode". It would say it called the tool, but doesn't.

On vLLM everything returns HTTP 200, I was able to mitigate the issue by passing --chat-template-content-format string to the engine. Before adding that around 40% of tool calls were failing.

When using structured format, the model would sometimes return the data with a different format, ex:

Expected: {"temperature": 70} Model sometimes returns: `json ```{"temperature": 70}````

I'm having the same issue with GLM 5.2 in Claude Code when entering and exiting Plan Mode.

chaunceyjiang · 2026-06-26T09:28:56Z

@gaby we see weird behavior when mtp is enabled for tool calling with output validation.

You can use this parameter to resolve the incompatibility between MTP and function calling.
--structured-outputs-config.enable_in_reasoning=True

gaby · 2026-06-26T13:35:19Z

@gaby we see weird behavior when mtp is enabled for tool calling with output validation.

You can use this parameter to resolve the incompatibility between MTP and function calling. --structured-outputs-config.enable_in_reasoning=True

Thanks, will give it a try. Probably worth updating the official recipe in https://recipes.vllm.ai/

[Frontend] Add Streaming Parser Engine and new GLM4.7/GLM5.1/GLM5.2 P…

d371488

…arser Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang requested review from aarnphm, bbrowning and sfeng33 as code owners June 17, 2026 08:46

mergify Bot added the tool-calling label Jun 17, 2026

github-project-automation Bot added this to Tool Calling Jun 17, 2026

[Frontend] Add Streaming Parser Engine and new GLM4.7/GLM5.1/GLM5.2 P…

da10b7b

…arser Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 18, 2026

chaunceyjiang commented Jun 18, 2026

View reviewed changes

chaunceyjiang mentioned this pull request Jun 18, 2026

[Bugfix] GLM tool parser: fix regex and robustness issues #42979

Closed

bbrowning approved these changes Jun 18, 2026

View reviewed changes

Merge branch 'main' into glm_tool_parser

7503d45

sfeng33 approved these changes Jun 18, 2026

View reviewed changes

chaunceyjiang merged commit 6c379b9 into vllm-project:main Jun 18, 2026
51 checks passed

github-project-automation Bot moved this to Done in Tool Calling Jun 18, 2026

sfeng33 mentioned this pull request Jun 18, 2026

[Bugfix] Fix GLM47 streaming inline zero-arg tool calls #44327

Closed

he-yufeng mentioned this pull request Jun 18, 2026

[Bugfix] Fix GLM4 MoE and SeedOSS reasoning parser regressions #37044

Closed

2 tasks

Achyuthan-S mentioned this pull request Jun 21, 2026

feat: model-agnostic reasoning content detection to prevent per-model parser proliferation #46049

Open

chaunceyjiang mentioned this pull request Jun 22, 2026

[Bugfix] Split terminal tool-call finish chunks #44099

Closed

chaunceyjiang deleted the glm_tool_parser branch June 26, 2026 09:29

MrZ20 mentioned this pull request Jul 2, 2026

[CI] upgrade vllm to 0619 vllm-project/vllm-ascend#10935

Open

clovisNyu mentioned this pull request Jul 3, 2026

[Bug]: GLM tool_call required + streaming causes JSON repetition #47504

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Frontend] Add Streaming Parser Engine and new GLM4.7/GLM5.1/GLM5.2 Parser#45915

[Frontend] Add Streaming Parser Engine and new GLM4.7/GLM5.1/GLM5.2 Parser#45915
chaunceyjiang merged 3 commits into
vllm-project:mainfrom
chaunceyjiang:glm_tool_parser

chaunceyjiang commented Jun 17, 2026 •

edited by github-actions Bot

Loading

bbrowning commented Jun 18, 2026

sfeng33 commented Jun 18, 2026

chaunceyjiang Jun 18, 2026

bbrowning left a comment

sfeng33 commented Jun 18, 2026

sfeng33 left a comment

Uh oh!

jinbagi commented Jun 18, 2026

frankwang28 commented Jun 19, 2026

gaby commented Jun 25, 2026

sfeng33 commented Jun 25, 2026

bbrowning commented Jun 25, 2026

gaby commented Jun 25, 2026

bbrowning commented Jun 25, 2026

gaby commented Jun 25, 2026

HyunCello commented Jun 25, 2026 •

edited

Loading

chaunceyjiang commented Jun 26, 2026

gaby commented Jun 26, 2026

Labels

7 participants

Uh oh!

Uh oh!

Conversation

chaunceyjiang commented Jun 17, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

bbrowning commented Jun 18, 2026

sfeng33 commented Jun 18, 2026

chaunceyjiang Jun 18, 2026

Choose a reason for hiding this comment

bbrowning left a comment

Choose a reason for hiding this comment

sfeng33 commented Jun 18, 2026

sfeng33 left a comment

Choose a reason for hiding this comment

Uh oh!

jinbagi commented Jun 18, 2026

frankwang28 commented Jun 19, 2026

gaby commented Jun 25, 2026

sfeng33 commented Jun 25, 2026

bbrowning commented Jun 25, 2026

gaby commented Jun 25, 2026

bbrowning commented Jun 25, 2026

gaby commented Jun 25, 2026

HyunCello commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

chaunceyjiang commented Jun 26, 2026

gaby commented Jun 26, 2026

Labels

7 participants

chaunceyjiang commented Jun 17, 2026 •

edited by github-actions Bot

Loading

HyunCello commented Jun 25, 2026 •

edited

Loading