Skip to content

fix(anthropic): preserve inline system message position for prefix caching#44602

Merged
chaunceyjiang merged 2 commits into
vllm-project:mainfrom
felix0080:fix/anthropic-inline-system-preserve-position
Jun 18, 2026
Merged

fix(anthropic): preserve inline system message position for prefix caching#44602
chaunceyjiang merged 2 commits into
vllm-project:mainfrom
felix0080:fix/anthropic-inline-system-preserve-position

Conversation

@felix0080

@felix0080 felix0080 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Problem

PR #44283 merged all inline role: system messages from the messages array into a single leading system message. This changes the conversation prefix, breaking KV-cache hits in multi-turn dialogues.

#44048 (currently open) moves the same merge logic to the protocol layer but retains the same prefix-breaking behavior.

Example of the problem

Input:  [user:A, assistant:B, system:new_rule, user:C]
                ↑ prefix cache can hit here

#44283: [system:(all merged), user:A, assistant:B, user:C]
         ↑ prefix completely different → cache miss

This PR: [system:top-level, user:A, assistant:B, system:new_rule, user:C]
              ↑ prefix unchanged → cache hits preserved

Fix

  • Remove inline system message extraction from _convert_system_message — only handle top-level system field there
  • In _convert_messages, handle system messages with a dedicated _extract_system_text helper that:
    • Strips x-anthropic-billing-header from inline system messages (previously only done for top-level)
    • Only emits a system message if there is real content (avoids empty {"role": "system"} messages that _convert_block could produce)
  • Add 2 new tests for billing header stripping on inline system messages

Why this approach

  • Minimal and localized: all system handling is explicit, not spread across _convert_block / _convert_message_content
  • Prefix structure stays intact for all conversation turns
  • Billing header stripping is consistent between top-level and inline system messages

Related

Test Plan

(AI assistance was used; I reviewed every changed line.)

python -m pytest tests/entrypoints/anthropic/test_anthropic_messages_conversion.py -v

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify mergify Bot added the frontend label Jun 5, 2026
@felix0080 felix0080 force-pushed the fix/anthropic-inline-system-preserve-position branch from 71ef5be to 835f37d Compare June 5, 2026 02:45
@felix0080

Copy link
Copy Markdown
Contributor Author

Ready for review — could a maintainer add the ready label to trigger CI? Thanks.

Comment thread vllm/entrypoints/anthropic/serving.py Outdated
) -> None:
"""Convert Anthropic messages to OpenAI format"""

def _extract_system_text(msg) -> str | None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please convert this method into a class method by adding @classmethod.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaunceyjiang @chaunceyjiang Done. I've converted it to a class method

openai_messages.append({"role": "system", "content": "".join(system_parts)})

@classmethod
def _extract_system_text(cls, msg) -> str | None:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaunceyjiang Done. I've converted it to a class method

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to DCO.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaunceyjiang Thanks for the reminder. DCO fixed

@felix0080 felix0080 force-pushed the fix/anthropic-inline-system-preserve-position branch from e81f76a to 4439ea4 Compare June 5, 2026 08:00
@aleksandaryanakiev

Copy link
Copy Markdown
Contributor

LGTM

@chaunceyjiang chaunceyjiang added the verified Run pre-commit for new contributors without triggering other tests label Jun 5, 2026
if msg.role == "system":
text = cls._extract_system_text(msg)
if text:
openai_messages.append({"role": "system", "content": text})

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, after this change, the Qwen3.5/Qwen3.6 series models will no longer be supported.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaunceyjiang This change is meant to preserve prefix caching for Anthropic clients like Claude Code that send system messages mid-conversation. The conflict with Qwen's chat template is a template-level limitation — Qwen expects system to appear only at the beginning — and that should be addressed by updating the Qwen template to handle non-leading system messages, not by compromising the conversion layer for all users.

@bbrowning bbrowning Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will impact not only Qwen models - even though many models may allow system messages at any position in the message list it doesn't mean those models were trained on system messages that come after user messages in a conversation. Most are not trained on this kind of data, and expect the system messages (even if more than 1) to come before the user messages.

Are we aware of any open weight model specifically trained on system messages that appear later in a conversation? This feels like we're trading KV cache efficiency for worse overall trajectories in these agentic workflows.

@felix0080 felix0080 Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbrowning
Anecdotally, we're running GLM-5.1 and Deepseek-v4 in PD-disaggregated setups with inline system messages preserved, and we haven't observed degraded output quality compared to merging them to the front. The trajectories appear consistent.
This suggests the "training gap" may be narrower than assumed — at least for these models, the capability seems to exist even if it wasn't explicitly optimized for.

@felix0080 felix0080 Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbrowning
But this matters beyond just quality — it directly impacts deployment architecture.
With PD disaggregation and pooling becoming the standard deployment pattern, the prefill node's value is almost entirely in prefix cache hits. In agentic workflows, 99% of the conversation prefix is identical across requests — the system prompt and prior turns rarely change. If merging inline system messages to the front invalidates the prefix, the prefill node essentially loses its purpose, and the whole PD architecture becomes unusable for these workloads.
So the trade-off is not just "cache vs quality" — it's whether PD-pooled deployments can serve agentic Anthropic clients at all.
That said, I don't want to dismiss the training-distribution concern. The question is whether a universal merge is the right default, or if deployers should be able to choose based on their model and infrastructure.
I'm happy to add an opt-in mechanism to this PR, or follow up with a separate one — whichever the maintainers prefer. For example, enable_inline_system_merge defaulting to true for backward compatibility: deployers running Qwen keep it enabled, those serving Claude Code with PD separation can disable it and preserve prefix cache viability.
Does that approach work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbrowning @dr75 Thanks for the review and accepting the approach. Happy to follow up with the opt-in merge flag separately if needed. Let me know if anything else is needed to get this merged.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backward compatibility is very necessary, so I have always believed that merging by default into the top level and retaining mid-system-prompt when the flag is enabled seems to be the most reasonable solution. This way, at least all models can support claude, and for large-scale deployment, users can be required to explicitly enable the flag to ensure the highest kvcache efficiency

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbrowning I’m inclined to merge this PR, since this is a Qwen-specific issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaunceyjiang Friendly ping — just wanted to check if there's anything else needed to move this forward. Thanks!

@felix0080 felix0080 Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaunceyjiang All checks are green now — ready to go. Thanks!

@chaunceyjiang chaunceyjiang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 5, 2026
@felix0080 felix0080 closed this Jun 7, 2026
@felix0080 felix0080 reopened this Jun 7, 2026
@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @felix0080.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 11, 2026
@felix0080 felix0080 force-pushed the fix/anthropic-inline-system-preserve-position branch from 99eaa7e to 88f0825 Compare June 12, 2026 02:46
@mergify

mergify Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Hi @felix0080, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@mergify mergify Bot removed the needs-rebase label Jun 12, 2026
@felix0080 felix0080 force-pushed the fix/anthropic-inline-system-preserve-position branch 2 times, most recently from b3a8af5 to 231acc1 Compare June 12, 2026 05:52
@mergify

mergify Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Hi @felix0080, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@felix0080 felix0080 force-pushed the fix/anthropic-inline-system-preserve-position branch from 231acc1 to ff93cd5 Compare June 12, 2026 06:55
@mergify

mergify Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Hi @felix0080, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

…ching

PR vllm-project#44283 merged all inline system:role messages into a single leading
system message, which changes the conversation prefix and breaks
KV-cache hits in multi-turn dialogues.

This fix keeps inline system messages at their original position:

- Remove inline system extraction from _convert_system_message (only
  top-level system is handled there)
- In _convert_messages, handle system messages with a dedicated
  _extract_system_text helper that strips billing headers and only
  emits the message if real content exists — avoiding the
  _convert_block / _convert_message_content path which does not strip
  billing headers and may omit the "content" key
- Add tests for billing header stripping on inline system messages

Unlike vllm-project#44048 which moves the same merge logic to the protocol layer,
this approach fundamentally avoids the prefix-breaking merge entirely.

Co-authored-by: Hermes Agent
Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
Per maintainer review feedback.

Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
@felix0080 felix0080 force-pushed the fix/anthropic-inline-system-preserve-position branch from ff93cd5 to 7adca21 Compare June 12, 2026 09:10
@cxg987

cxg987 commented Jun 13, 2026

Copy link
Copy Markdown

LGTM

@bbrowning

Copy link
Copy Markdown
Collaborator

I'm ok to merge this, but would like a fast follow-up ready to fix the issues with Qwen (or similar) models that cannot handle multiple system messages. I'm not sure of the best approach there - maybe we construct some messages with system -> user -> system turns, feed it into the chat template, and if it blows up set a flag that tells us to collapse for this model? We have a lot of users that use the Messages API with Qwen models of various sorts, so I'd like us to minimize the time we break them.

@felix0080

Copy link
Copy Markdown
Contributor Author

I'm ok to merge this, but would like a fast follow-up ready to fix the issues with Qwen (or similar) models that cannot handle multiple system messages. I'm not sure of the best approach there - maybe we construct some messages with system -> user -> system turns, feed it into the chat template, and if it blows up set a flag that tells us to collapse for this model? We have a lot of users that use the Messages API with Qwen models of various sorts, so I'd like us to minimize the time we break them.

@bbrowning @chaunceyjiang Thanks for approving. Sounds good — I'll submit a follow-up PR with auto-detection once this is merged.

@chaunceyjiang chaunceyjiang merged commit 1e9f04d into vllm-project:main Jun 18, 2026
48 of 49 checks passed
felix0080 added a commit to felix0080/vllm that referenced this pull request Jun 18, 2026
…tem messages

When the chat template requires system-first ordering (e.g., Qwen3.5/3.6
with its loop.first guard), inline system messages preserved at their
original position by vllm-project#44602 would be rejected at template render time.

This adds auto-detection: a [system, user, system, user] conversation
is rendered against the template at init time. The detection covers
three scenarios:

- Template raises → merge inline system into top-level block
  (compatible with Qwen and similar models)
- Template succeeds → preserve inline system in-place
  (optimal prefix caching for models that support it)
- No template → conservative default: merge

No flag or configuration needed.

Co-authored-by: Hermes Agent
Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
@felix0080

Copy link
Copy Markdown
Contributor Author

Follow-up PR is now ready: #46025 — auto-detects template support for mid-conversation system messages so Qwen and similar models are automatically compatible.

djramic pushed a commit to djramic/vllm that referenced this pull request Jun 18, 2026
…ching (vllm-project#44602)

Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
Co-authored-by: felix0080 <felix0080@users.noreply.github.com>
felix0080 added a commit to felix0080/vllm that referenced this pull request Jun 18, 2026
…tem messages

When the chat template requires system-first ordering (e.g., Qwen3.5/3.6
with its loop.first guard), inline system messages preserved at their
original position by vllm-project#44602 would be rejected at template render time.

This adds auto-detection: a [system, user, system, user] conversation
is rendered against the template at init time. The detection covers
three scenarios:

- Template raises → merge inline system into top-level block
  (compatible with Qwen and similar models)
- Template succeeds → preserve inline system in-place
  (optimal prefix caching for models that support it)
- No template → conservative default: merge

No flag or configuration needed.

Co-authored-by: Hermes Agent
Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
felix0080 added a commit to felix0080/vllm that referenced this pull request Jun 18, 2026
…tem messages

When the chat template requires system-first ordering (e.g., Qwen3.5/3.6
with its loop.first guard), inline system messages preserved at their
original position by vllm-project#44602 would be rejected at template render time.

This adds auto-detection: a [system, user, system, user] conversation
is rendered against the template at init time. The detection covers
three scenarios:

- Template raises → merge inline system into top-level block
  (compatible with Qwen and similar models)
- Template succeeds → preserve inline system in-place
  (optimal prefix caching for models that support it)
- No template → conservative default: merge

No flag or configuration needed.

Co-authored-by: Hermes Agent
Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…ching (vllm-project#44602)

Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
Co-authored-by: felix0080 <felix0080@users.noreply.github.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
Stardust-minus added a commit to Stardust-minus/vllm that referenced this pull request Jun 19, 2026
…stem merge

When a client (e.g. Claude Code) appends a system message to the
messages array of a /v1/messages request, the Anthropic compatibility
layer could trigger a full prompt recompute (observed as a ~75s TTFT
spike) because the entire prefix-cache block chain is invalidated
from the system block onward.

Root cause: AnthropicServingMessages.__init__ calls
_detect_merge_inline_system(chat_template) with the raw --chat-template
arg, which is None by default. The detector returns True ("merge inline
system into the leading block") whenever the template is falsy — even for
templates like GLM that already accept mid-conversation system messages.
That merge hoists the trailing system message into the leading system
block, changing tokens from offset 0 onward and breaking the chained
block hash, so every subsequent block's hash changes and a full recompute
is required.

Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the
merge-vs-inplace switch but defaulted to the conservative (merge) path.
This closes the gap: resolve the model's actual chat template (via
resolve_chat_template, the same call the renderer uses) before deciding,
so GLM-like templates are treated as "no merge" and a trailing system
message no longer destroys the prefix.

Changes:
- __init__: when chat_template is None, resolve the model's actual
  template via resolve_chat_template(tokenizer, None, None,
  model_config=...) (tokenizer from self.renderer.tokenizer), then feed
  that to _detect_merge_inline_system. Falls back to the existing
  conservative behavior only when no template can be resolved.
- _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext
  explicitly (the package __init__ does not pull them in, so
  jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something
  else had imported the submodule). Broadened the catch to
  except Exception so malformed/non-template input falls back cleanly.
- tests: add test_glm_template_does_not_merge.

Co-Authored-By: Claude <noreply@anthropic.com>
Stardust-minus added a commit to Stardust-minus/vllm that referenced this pull request Jun 19, 2026
…stem merge

When a client (e.g. Claude Code) appends a system message to the
messages array of a /v1/messages request, the Anthropic compatibility
layer could trigger a full prompt recompute (observed as a ~75s TTFT
spike) because the entire prefix-cache block chain is invalidated
from the system block onward.

Root cause: AnthropicServingMessages.__init__ calls
_detect_merge_inline_system(chat_template) with the raw --chat-template
arg, which is None by default. The detector returns True ("merge inline
system into the leading block") whenever the template is falsy — even for
templates like GLM that already accept mid-conversation system messages.
That merge hoists the trailing system message into the leading system
block, changing tokens from offset 0 onward and breaking the chained
block hash, so every subsequent block's hash changes and a full recompute
is required.

Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the
merge-vs-inplace switch but defaulted to the conservative (merge) path.
This closes the gap: resolve the model's actual chat template (via
resolve_chat_template, the same call the renderer uses) before deciding,
so GLM-like templates are treated as "no merge" and a trailing system
message no longer destroys the prefix.

Changes:
- __init__: when chat_template is None, resolve the model's actual
  template via resolve_chat_template(tokenizer, None, None,
  model_config=...) (tokenizer from self.renderer.tokenizer), then feed
  that to _detect_merge_inline_system. Falls back to the existing
  conservative behavior only when no template can be resolved.
- _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext
  explicitly (the package __init__ does not pull them in, so
  jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something
  else had imported the submodule). Broadened the catch to
  except Exception so malformed/non-template input falls back cleanly.
- tests: add test_glm_template_does_not_merge.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Stardust-minus <stardust@fish.audio>
Stardust-minus added a commit to Stardust-minus/vllm that referenced this pull request Jun 20, 2026
…stem merge

When a client (e.g. Claude Code) appends a system message to the
messages array of a /v1/messages request, the Anthropic compatibility
layer could trigger a full prompt recompute (observed as a ~75s TTFT
spike) because the entire prefix-cache block chain is invalidated
from the system block onward.

Root cause: AnthropicServingMessages.__init__ calls
_detect_merge_inline_system(chat_template) with the raw --chat-template
arg, which is None by default. The detector returns True ("merge inline
system into the leading block") whenever the template is falsy — even for
templates like GLM that already accept mid-conversation system messages.
That merge hoists the trailing system message into the leading system
block, changing tokens from offset 0 onward and breaking the chained
block hash, so every subsequent block's hash changes and a full recompute
is required.

Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the
merge-vs-inplace switch but defaulted to the conservative (merge) path.
This closes the gap: resolve the model's actual chat template (via
resolve_chat_template, the same call the renderer uses) before deciding,
so GLM-like templates are treated as "no merge" and a trailing system
message no longer destroys the prefix.

Changes:
- __init__: when chat_template is None, resolve the model's actual
  template via resolve_chat_template(tokenizer, None, None,
  model_config=...) (tokenizer from self.renderer.tokenizer), then feed
  that to _detect_merge_inline_system. Falls back to the existing
  conservative behavior only when no template can be resolved.
- _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext
  explicitly (the package __init__ does not pull them in, so
  jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something
  else had imported the submodule). Broadened the catch to
  except Exception so malformed/non-template input falls back cleanly.
- tests: add test_glm_template_does_not_merge.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Stardust-minus <stardust@fish.audio>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026
…ching (vllm-project#44602)

Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
Co-authored-by: felix0080 <felix0080@users.noreply.github.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…ching (vllm-project#44602)

Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
Co-authored-by: felix0080 <felix0080@users.noreply.github.com>
Stardust-minus added a commit to Stardust-minus/vllm that referenced this pull request Jun 23, 2026
…stem merge

When a client (e.g. Claude Code) appends a system message to the
messages array of a /v1/messages request, the Anthropic compatibility
layer could trigger a full prompt recompute (observed as a ~75s TTFT
spike) because the entire prefix-cache block chain is invalidated
from the system block onward.

Root cause: AnthropicServingMessages.__init__ calls
_detect_merge_inline_system(chat_template) with the raw --chat-template
arg, which is None by default. The detector returns True ("merge inline
system into the leading block") whenever the template is falsy — even for
templates like GLM that already accept mid-conversation system messages.
That merge hoists the trailing system message into the leading system
block, changing tokens from offset 0 onward and breaking the chained
block hash, so every subsequent block's hash changes and a full recompute
is required.

Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the
merge-vs-inplace switch but defaulted to the conservative (merge) path.
This closes the gap: resolve the model's actual chat template (via
resolve_chat_template, the same call the renderer uses) before deciding,
so GLM-like templates are treated as "no merge" and a trailing system
message no longer destroys the prefix.

Changes:
- __init__: when chat_template is None, resolve the model's actual
  template via resolve_chat_template(tokenizer, None, None,
  model_config=...) (tokenizer from self.renderer.tokenizer), then feed
  that to _detect_merge_inline_system. Falls back to the existing
  conservative behavior only when no template can be resolved.
- _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext
  explicitly (the package __init__ does not pull them in, so
  jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something
  else had imported the submodule). Broadened the catch to
  except Exception so malformed/non-template input falls back cleanly.
- tests: add test_glm_template_does_not_merge.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Stardust-minus <stardust@fish.audio>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…ching (vllm-project#44602)

Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
Co-authored-by: felix0080 <felix0080@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed verified Run pre-commit for new contributors without triggering other tests

7 participants