Skip to content

[Frontend] Add Streaming Parser Engine and new GLM4.7/GLM5.1/GLM5.2 Parser#45915

Merged
chaunceyjiang merged 3 commits into
vllm-project:mainfrom
chaunceyjiang:glm_tool_parser
Jun 18, 2026
Merged

[Frontend] Add Streaming Parser Engine and new GLM4.7/GLM5.1/GLM5.2 Parser#45915
chaunceyjiang merged 3 commits into
vllm-project:mainfrom
chaunceyjiang:glm_tool_parser

Conversation

@chaunceyjiang

@chaunceyjiang chaunceyjiang commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Purpose

[Frontend] Add Streaming Parser Engine and new GLM4.7/GLM5.1/GLM5.2 Parser

Test Plan

I tested it locally with both enable_thinking=True and enable_thinking=False, as well as with stream=True and stream=False. In all cases, the output was parsed correctly.

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
…arser

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@bbrowning

Copy link
Copy Markdown
Collaborator

This looks directionally correct, although I haven't tested it against a live version of these models yet.

One note - this isn't wired into tests/parser/engine/trace_builder.py so pytest -v tests/parser/engine/test_*replay.py currently fails with:

RuntimeError: Engine adapters in registered_adapters have no test builder in trace_builder._BUILDERS: Glm47MoeParser (config.name='glm47_moe'). Add a builder to _BUILDERS for each new parser.

We can relax this if desired, but at least for all these initial ones I added that to ensure that we wire new parsers into the replay harnesses which have a lot of shared parsing tests at various token sizes per delta to fuzz out any bad behavior across token boundaries.

Expand the details section immediately below here to see a diff of what I think that would be, but I didn't double-check that this is exactly following the GLM 4.7 format. It looks like the docstring in the parser file though.

trace_builder.py changes
diff --git a/tests/parser/engine/trace_builder.py b/tests/parser/engine/trace_builder.py
index 128e511e6..c7a722bc4 100644
--- a/tests/parser/engine/trace_builder.py
+++ b/tests/parser/engine/trace_builder.py
@@ -30,6 +30,7 @@ from vllm.entrypoints.openai.chat_completion.protocol import (
 )
 from vllm.parser.engine.registered_adapters import (
     Gemma4Parser,
+    Glm47MoeParser,
     MinimaxM2Parser,
     NemotronV3Parser,
     Qwen3Parser,
@@ -571,6 +572,75 @@ def _build_nemotron_v3(scenario: Scenario, validate: bool = True) -> Sample:
     )
 
 
+# ── GLM-4.7 (XML arg_key/arg_value format, starts in REASONING) ────
+
+_GLM47_MOE_VOCAB: dict[str, int] = {
+    "<think>": 50,
+    "</think>": 51,
+    "<tool_call>": 60,
+    "</tool_call>": 61,
+}
+
+
+def _glm47_moe_arg_value(value: Any) -> str:
+    if isinstance(value, str):
+        return value
+    if isinstance(value, bool):
+        return "true" if value else "false"
+    if isinstance(value, (int, float)):
+        return str(value)
+    return json.dumps(value, ensure_ascii=False)
+
+
+def _glm47_moe_tool_segments(tc: ToolCallSpec) -> list[tuple[str, bool]]:
+    segs: list[tuple[str, bool]] = [("<tool_call>", True)]
+    parts = [tc.name]
+    for key, value in tc.arguments.items():
+        val_str = _glm47_moe_arg_value(value)
+        parts.append(
+            f"<arg_key>{key}</arg_key><arg_value>{val_str}</arg_value>"
+        )
+    segs.append(("".join(parts), False))
+    segs.append(("</tool_call>", True))
+    return segs
+
+
+def _glm47_moe_segments(scenario: Scenario) -> list[tuple[str, bool]]:
+    segs: list[tuple[str, bool]] = []
+    if scenario.reasoning is not None:
+        segs.append((scenario.reasoning, False))
+    if scenario.content is not None or scenario.tool_calls:
+        segs.append(("</think>", True))
+    if scenario.content is not None:
+        segs.append((scenario.content, False))
+    if scenario.tool_calls:
+        for tc in scenario.tool_calls:
+            segs.extend(_glm47_moe_tool_segments(tc))
+    return segs
+
+
+def _build_glm47_moe(scenario: Scenario, validate: bool = True) -> Sample:
+    expected_reasoning: str | None
+    if scenario.reasoning is not None:
+        expected_reasoning = scenario.reasoning.rstrip()
+    else:
+        expected_reasoning = ""
+
+    sample = _make_sample(
+        sample_id=f"glm47_moe-{scenario.id}",
+        description=scenario.description,
+        vocab=_GLM47_MOE_VOCAB,
+        segments=_glm47_moe_segments(scenario),
+        expected_reasoning=expected_reasoning,
+        expected_content=_qwen3_expected_content(scenario),
+        expected_tool_calls=_expected_tc(scenario),
+        tools=_expected_tools(scenario),
+    )
+    if validate:
+        _validate_sample(sample, Glm47MoeParser)
+    return sample
+
+
 # ── Registry and public API ──────────────────────────────────────────
 
 _BUILDERS: dict[str, Any] = {
@@ -578,6 +648,7 @@ _BUILDERS: dict[str, Any] = {
     "gemma4": _build_gemma4,
     "minimax_m2": _build_minimax_m2,
     "nemotron_v3": _build_nemotron_v3,
+    "glm47_moe": _build_glm47_moe,
 }

After adding this parser to trace_builder.py, you can confirm it works with pytest -v tests/parser/engine/ -k glm47 and you'll see it run a few hundred quick unit tests to test different boundary conditions at various chunk sizes.

@sfeng33

sfeng33 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Is it feasible to migrate both glm 45 and 47 in this PR, the two tool parser share 99% of the code.

…arser

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 18, 2026
class Glm47MoeModelToolParser(Glm4MoeModelToolParser):
class Glm47MoeModelToolParser(Glm47MoeParserToolAdapter): # type: ignore[valid-type, misc]
supports_required_and_named = False
structural_tag_model = "glm_4_7"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that the glm_4_7 structural tag is also compatible with the glm_4_5 format.

@bbrowning bbrowning left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me - your update to the trace_builder.py did a better job of representing the streaming tokens properly for those tests than my quickly put together version. And from what I dug into the differences in the old glm45 vs glm47 tool parser, this looks safe to consolidate under the new version.

Approving this without running myself on a live model, as the tests give good confidence in the parsing behavior here for many cases. With the recent release of GLM 5.2 and its improved MTP, this will substantially clean up that path for streaming usage.

Thanks!

@sfeng33

sfeng33 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

✅ Reasoning parser swallows tool tokens (the central bug)

Issues: #42400 (GLM-5.1 Claude Code: stop_reason=tool_use but no tool block), #46040 (GLM-5.2 emits <tool_call> inside , XML leaks into reasoning), framing of #46049. Competing PR: #40659.
Root cause: the old glm45 reasoning parser consumed everything up to as reasoning, so a <tool_call> emitted before was never handed to the tool parser.
Verdict: FIXED — structurally. With the unified grammar, reasoning ends at <tool_call> regardless of . This supersedes the band-aid in #40659 (which special-cased the -already-seen token). High confidence.

✅ Streaming tool name truncated / unstable

Issues: #39757 (run_in_terminal→run_in, get_weather→get). Competing PRs: #40071 (delay prefix names), #41654 (zero-arg names).
Root cause: name streamed as a prefix before the full name region closed; spec-decode/MTP amplified it.
Verdict: FIXED. The engine accumulates the entire TOOL_NAME region (until <arg_key> or </tool_call>) and emits the name once, whole, gated by validate_tool_names (find_tool_name against declared tools). A prefix that isn't a real tool name is never emitted. Supersedes #40071.

✅ Zero-argument inline tool calls dropped in streaming

Issue: #44326. Competing PR: #41654.
Root cause: old streaming name-extraction returned None for <tool_call>get_current_time</tool_call> (no \n/<arg_key>), so the call vanished in streaming while non-streaming worked.
Verdict: FIXED. (TOOL_NAME, TOOL_END)→TOOL_CALL_END; _handle_tool_end emits the buffered name with arguments="". The PR adds explicit tests test_zero_arg_inline / test_no_args (streaming). High confidence.

✅ Streaming argument-JSON corruption

Issue: #40195 (Optional[str] → Smithh", arrays → [...]]). Competing PR: #40197.
Root cause: partial renders weren't byte-prefixes of the final render; length-slicing duplicated chars.
Verdict: FIXED. _safe_arg_prefix() only streams up to the last complete top-level value (never the in-flight one), with a hard startswith(prev) invariant; the unstable trailing value is flushed only at tool end. Plus schema-aware _fix_arg_types. This is the same fix #40197 proposed, now built into the engine.

🟡 MTP + tool-calling malformed args

Issues: #44843 (vanishes when MTP off), args part of #39757.
Verdict: LIKELY FIXED at the parser level (the engine re-derives from accumulated text+token-ids, so it's robust to MTP token-boundary differences) — but lower confidence, because if MTP corrupts the actual generated text (not just token boundaries) no parser can repair it. Needs a real GLM+MTP run to confirm.

✅ Responses API format mismatch

Issue: #45273 (parser part: AttributeError: 'FunctionTool' object has no attribute 'function'). Competing PRs: #45276, #41631.
Verdict: FIXED for the parser path. The engine reads tool metadata via utils.find_tool_name/find_tool_properties, which handle both shapes (tool.name for FunctionTool and tool.function.name). And the engine's adjust_request only sets skip_special_tokens=False — it never injects a Hermes JSON schema, so #41631 (schema injection for Responses named-function) is sidestepped entirely.
Note: #45273's other failures (request-validation 400/500, empty-content IndexError, historical-args JSONDecodeError) live in the Responses API layer and are out of scope of #45915.

🟡 Serving-layer streaming chunk shape

Issue: #44098 (continuation chunks re-emit id/type/name; last arg packed into the finish_reason chunk).
Verdict: PARTIALLY. The engine's own continuation deltas carry only {index, arguments} (no metadata re-emission), and _flush_engine_parsers replaces the buggy _create_remaining_args_delta path — so bug 1 is structurally avoided. Bug 2 (final arg fragment delivered in the same chunk as finish_reason) is a stream-generator concern that this PR doesn't clearly change. Lower confidence; worth a targeted check.

🔴 Reasoning token counting

Issue / competing PR: #41077.
Verdict: LIKELY NOT FIXED (gap). The engine's count_reasoning_tokens requires seeing THINK_START (depth>0) to count. GLM injects in the prompt, so generated output often has no start token → returns 0, the exact #41077 bug. #41077's specific remedy (count everything before the first end-id when no start present) is not replicated. Low severity (usage accounting), but a real regression-or-no-fix.

🔴 Missing classification / GLM-4.5 & SeedOSS regressions

Issue / competing PR: #37044.
Verdict: BEHAVIOR CHANGED, partially out of scope. New engine starts in REASONING, so tagless output is classified as reasoning (asserted by tor a reasoning model that always closes this is fine; if a GLM model emits content with no at all it'd be misclassified — the
opposite of #37044's intent. SeedOSS (the other half of #37044) is untouched and out of scope.

🔴 Tool-result rendering / content-format (input side)

Issue / competing PR: #39630 / #39614.
Verdict: NOT FIXED — out of scope. This is prompt rendering in vllm/renderers/hf.py, which #45915 doesn't touch. Unrelated to output parsing.


Bottom line

PR #45915 is not a point-fix — it's a rewrite that collapses the reasoning+tool split into one grammar, which is exactly the seam that caused ugs.- Solidly fixed: A (tool-in-reasoning — the big one), B (name truncation), C (zero-arg streaming), D (arg-JSON corruption), F (Responses FunctThat covers the bulk of open issues #42400, #46040, #44326, #39757, #40195, and the parser slice of #45273 — and supersedes competing PRs#40659, #40071, #41654, #40197.

@sfeng33 sfeng33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work!

@chaunceyjiang chaunceyjiang merged commit 6c379b9 into vllm-project:main Jun 18, 2026
51 checks passed
@jinbagi

jinbagi commented Jun 18, 2026

Copy link
Copy Markdown

Nice work!

@frankwang28

Copy link
Copy Markdown
Contributor

Hi! I'm wondering if the PR / fix is in the vllm/vllm-openai:glm52 Docker image? Thanks!

divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…arser (vllm-project#45915)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026
…arser (vllm-project#45915)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…arser (vllm-project#45915)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…arser (vllm-project#45915)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
@gaby

gaby commented Jun 25, 2026

Copy link
Copy Markdown

@frankwang28 It is not, you have to use nightly.

@sfeng33 Even with this fix, we see weird behavior when mtp is enabled for tool calling with output validation.

It did improve this a lot thought, compared to initial release of GLM-5.x support

@sfeng33

sfeng33 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

@gaby thanks for taking a look! Yeah MTP is a known issue across several models.

@bbrowning

Copy link
Copy Markdown
Collaborator

Even with this fix, we see weird behavior when mtp is enabled for tool calling with output validation.

What is the weird behavior you see? Tool call parse or reasoning failures such as content cut off, tags escaping, etc? Errors in requests and vLLM logs about failed FSM advances in the grammar? Or more subtle issues such as the model not calling the right tools or forgetting what tools it has?

I just want to confirm which of these (or something else) mtp is triggering so we can reproduce and fix it.

@gaby

gaby commented Jun 25, 2026

Copy link
Copy Markdown

Even with this fix, we see weird behavior when mtp is enabled for tool calling with output validation.

What is the weird behavior you see? Tool call parse or reasoning failures such as content cut off, tags escaping, etc? Errors in requests and vLLM logs about failed FSM advances in the grammar? Or more subtle issues such as the model not calling the right tools or forgetting what tools it has?

I just want to confirm which of these (or something else) mtp is triggering so we can reproduce and fix it.

@bbrowning I don't have an example code to share, but behavior I do. Ex when using claude code in plan mode, the model is not able to call tools to exit plan mode "ExitPlanMode". It would say it called the tool, but doesn't.

On vLLM everything returns HTTP 200, I was able to mitigate the issue by passing --chat-template-content-format string to the engine. Before adding that around 40% of tool calls were failing.

When using structured format, the model would sometimes return the data with a different format, ex:

Expected: {"temperature": 70}
Model sometimes returns: `json ```{"temperature": 70}````

@bbrowning

Copy link
Copy Markdown
Collaborator

@gaby If you feel up for testing something, try setting --kv-cache-dtype fp8 or --no-enable-prefix-caching with MTP enabled and try again. I've observed a few models with interaction issues between kv caching and MTP but haven't been able to narrow it down further yet. The way it typically manifests for me in agentic scenarios is the model stops calling tools, hallucinates tools, outputs things in the wrong format, etc. I think it's done to some kind of missing, corrupt, or stale data being read from the prefix cache at least in the cases I've observed with Nemotron 3 Super and Qwen 3.6 models.

@gaby

gaby commented Jun 25, 2026

Copy link
Copy Markdown

@bbrowning I will give it a try, i'm already using --kv-cache-dtype fp8_e4m3, but I do have prefix caching enabled. Which was not in the official recipe for GLM-5.2

@HyunCello

HyunCello commented Jun 25, 2026

Copy link
Copy Markdown

Even with this fix, we see weird behavior when mtp is enabled for tool calling with output validation.

What is the weird behavior you see? Tool call parse or reasoning failures such as content cut off, tags escaping, etc? Errors in requests and vLLM logs about failed FSM advances in the grammar? Or more subtle issues such as the model not calling the right tools or forgetting what tools it has?
I just want to confirm which of these (or something else) mtp is triggering so we can reproduce and fix it.

@bbrowning I don't have an example code to share, but behavior I do. Ex when using claude code in plan mode, the model is not able to call tools to exit plan mode "ExitPlanMode". It would say it called the tool, but doesn't.

On vLLM everything returns HTTP 200, I was able to mitigate the issue by passing --chat-template-content-format string to the engine. Before adding that around 40% of tool calls were failing.

When using structured format, the model would sometimes return the data with a different format, ex:

Expected: {"temperature": 70} Model sometimes returns: `json ```{"temperature": 70}````

I'm having the same issue with GLM 5.2 in Claude Code when entering and exiting Plan Mode.

@chaunceyjiang

Copy link
Copy Markdown
Collaborator Author

@gaby we see weird behavior when mtp is enabled for tool calling with output validation.

You can use this parameter to resolve the incompatibility between MTP and function calling.
--structured-outputs-config.enable_in_reasoning=True

@chaunceyjiang chaunceyjiang deleted the glm_tool_parser branch June 26, 2026 09:29
@gaby

gaby commented Jun 26, 2026

Copy link
Copy Markdown

@gaby we see weird behavior when mtp is enabled for tool calling with output validation.

You can use this parameter to resolve the incompatibility between MTP and function calling. --structured-outputs-config.enable_in_reasoning=True

Thanks, will give it a try. Probably worth updating the official recipe in https://recipes.vllm.ai/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed tool-calling

7 participants