Skip to content

fix(anthropic): auto-detect template support for mid-conversation system messages#46025

Merged
bbrowning merged 2 commits into
vllm-project:mainfrom
felix0080:fix/anthropic-auto-detect-inline-system
Jun 18, 2026
Merged

fix(anthropic): auto-detect template support for mid-conversation system messages#46025
bbrowning merged 2 commits into
vllm-project:mainfrom
felix0080:fix/anthropic-auto-detect-inline-system

Conversation

@felix0080

@felix0080 felix0080 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Problem

#44602 preserves inline role: system messages at their original position for prefix caching. However, some models have chat templates that reject system messages appearing after the first position (e.g., Qwen3.5/3.6, #41114). Without mitigation, these models would return 400 errors for valid Anthropic requests.

Fix

Auto-detect whether the chat template supports mid-conversation system messages by rendering a [system, user, system, user] test conversation at init time. The detection covers three scenarios:

  • Template raises (e.g. Qwen loop.first guard)
    → inline system messages are merged into the top-level block
  • Template succeeds (most models)
    → inline system messages are preserved in-place for optimal prefix caching
  • No template (chat_template=None)
    → conservative default: merge

No flag or configuration needed — the server adapts automatically to
the model's template capabilities.

Changes

2 files, +96 lines:

  • _detect_merge_inline_system() classmethod — renders a Jinja test conversation, caches result as class attribute
  • _convert_system_message() — when merging, extracts inline system messages into the top-level block
  • _convert_messages() — when merging, skips system messages (already handled by _convert_system_message)
  • Tests: 3 cases covering Qwen guard, unrestricted template, and None template

Behavior matrix

Template Detection Effect
Qwen (system-first) merge=True merge → compatible
Llama (no restriction) merge=False preserve → cache optimal
None (no template) merge=True merge → safe default

Related

Test Plan

(AI assistance was used; I reviewed every changed line.)

python -m pytest tests/entrypoints/anthropic/test_anthropic_messages_conversion.py::TestDetectMergeInlineSystem -v
@felix0080

Copy link
Copy Markdown
Contributor Author

@chaunceyjiang @bbrowning follow-up to #44602 as discussed. Auto-detection added — Qwen models get inline system messages merged automatically, models without template restrictions keep the cache-friendly behavior.

@bbrowning bbrowning left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start, but a few changes requested around mutable state on classes and jinja sandboxing for security.

Comment thread vllm/entrypoints/anthropic/serving.py Outdated
Comment thread vllm/entrypoints/anthropic/serving.py Outdated
Comment thread vllm/entrypoints/anthropic/serving.py Outdated
Comment thread vllm/entrypoints/anthropic/serving.py Outdated
@felix0080 felix0080 force-pushed the fix/anthropic-auto-detect-inline-system branch from a13a316 to 77de025 Compare June 18, 2026 13:31
…tem messages

When the chat template requires system-first ordering (e.g., Qwen3.5/3.6
with its loop.first guard), inline system messages preserved at their
original position by vllm-project#44602 would be rejected at template render time.

This adds auto-detection: a [system, user, system, user] conversation
is rendered against the template at init time. The detection covers
three scenarios:

- Template raises → merge inline system into top-level block
  (compatible with Qwen and similar models)
- Template succeeds → preserve inline system in-place
  (optimal prefix caching for models that support it)
- No template → conservative default: merge

No flag or configuration needed.

Co-authored-by: Hermes Agent
Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
@felix0080 felix0080 force-pushed the fix/anthropic-auto-detect-inline-system branch from 77de025 to 0ec120a Compare June 18, 2026 13:35
@felix0080 felix0080 requested a review from bbrowning June 18, 2026 13:39
@bbrowning

bbrowning commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

I'm seeing ruff format failures in the pre-commit hook locally when I pulled this, which will fail in CI as well. You can see this locally to fix with:

pre-commit run --files tests/entrypoints/anthropic/test_anthropic_messages_conversion.py vllm/entrypoints/anthropic/serving.py

Otherwise, looks good! I tested on a live server and after #44602 merged get a 500 error from Qwen 3.6-27B and Claude Code without this PR and everything works fine with this PR.

[Edit]: I just pushed a commit with the ruff fixes on top of this to get this merged.

Signed-off-by: Ben Browning <bbrownin@redhat.com>

@bbrowning bbrowning left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this with me and fixing this for models that don't support duplicate system messages!

@bbrowning bbrowning added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 18, 2026
@vllm-project vllm-project deleted a comment from mergify Bot Jun 18, 2026
@bbrowning bbrowning merged commit 4ce2d01 into vllm-project:main Jun 18, 2026
51 of 53 checks passed
@felix0080

Copy link
Copy Markdown
Contributor Author

@bbrowning Thanks for the review, testing, and pushing the ruff fixes — appreciate you iterating with me on this.

divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…tem messages (vllm-project#46025)

Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: felix0080 <felix0080@users.noreply.github.com>
Co-authored-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
Stardust-minus added a commit to Stardust-minus/vllm that referenced this pull request Jun 19, 2026
…stem merge

When a client (e.g. Claude Code) appends a system message to the
messages array of a /v1/messages request, the Anthropic compatibility
layer could trigger a full prompt recompute (observed as a ~75s TTFT
spike) because the entire prefix-cache block chain is invalidated
from the system block onward.

Root cause: AnthropicServingMessages.__init__ calls
_detect_merge_inline_system(chat_template) with the raw --chat-template
arg, which is None by default. The detector returns True ("merge inline
system into the leading block") whenever the template is falsy — even for
templates like GLM that already accept mid-conversation system messages.
That merge hoists the trailing system message into the leading system
block, changing tokens from offset 0 onward and breaking the chained
block hash, so every subsequent block's hash changes and a full recompute
is required.

Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the
merge-vs-inplace switch but defaulted to the conservative (merge) path.
This closes the gap: resolve the model's actual chat template (via
resolve_chat_template, the same call the renderer uses) before deciding,
so GLM-like templates are treated as "no merge" and a trailing system
message no longer destroys the prefix.

Changes:
- __init__: when chat_template is None, resolve the model's actual
  template via resolve_chat_template(tokenizer, None, None,
  model_config=...) (tokenizer from self.renderer.tokenizer), then feed
  that to _detect_merge_inline_system. Falls back to the existing
  conservative behavior only when no template can be resolved.
- _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext
  explicitly (the package __init__ does not pull them in, so
  jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something
  else had imported the submodule). Broadened the catch to
  except Exception so malformed/non-template input falls back cleanly.
- tests: add test_glm_template_does_not_merge.

Co-Authored-By: Claude <noreply@anthropic.com>
Stardust-minus added a commit to Stardust-minus/vllm that referenced this pull request Jun 19, 2026
…stem merge

When a client (e.g. Claude Code) appends a system message to the
messages array of a /v1/messages request, the Anthropic compatibility
layer could trigger a full prompt recompute (observed as a ~75s TTFT
spike) because the entire prefix-cache block chain is invalidated
from the system block onward.

Root cause: AnthropicServingMessages.__init__ calls
_detect_merge_inline_system(chat_template) with the raw --chat-template
arg, which is None by default. The detector returns True ("merge inline
system into the leading block") whenever the template is falsy — even for
templates like GLM that already accept mid-conversation system messages.
That merge hoists the trailing system message into the leading system
block, changing tokens from offset 0 onward and breaking the chained
block hash, so every subsequent block's hash changes and a full recompute
is required.

Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the
merge-vs-inplace switch but defaulted to the conservative (merge) path.
This closes the gap: resolve the model's actual chat template (via
resolve_chat_template, the same call the renderer uses) before deciding,
so GLM-like templates are treated as "no merge" and a trailing system
message no longer destroys the prefix.

Changes:
- __init__: when chat_template is None, resolve the model's actual
  template via resolve_chat_template(tokenizer, None, None,
  model_config=...) (tokenizer from self.renderer.tokenizer), then feed
  that to _detect_merge_inline_system. Falls back to the existing
  conservative behavior only when no template can be resolved.
- _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext
  explicitly (the package __init__ does not pull them in, so
  jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something
  else had imported the submodule). Broadened the catch to
  except Exception so malformed/non-template input falls back cleanly.
- tests: add test_glm_template_does_not_merge.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Stardust-minus <stardust@fish.audio>
Stardust-minus added a commit to Stardust-minus/vllm that referenced this pull request Jun 20, 2026
…stem merge

When a client (e.g. Claude Code) appends a system message to the
messages array of a /v1/messages request, the Anthropic compatibility
layer could trigger a full prompt recompute (observed as a ~75s TTFT
spike) because the entire prefix-cache block chain is invalidated
from the system block onward.

Root cause: AnthropicServingMessages.__init__ calls
_detect_merge_inline_system(chat_template) with the raw --chat-template
arg, which is None by default. The detector returns True ("merge inline
system into the leading block") whenever the template is falsy — even for
templates like GLM that already accept mid-conversation system messages.
That merge hoists the trailing system message into the leading system
block, changing tokens from offset 0 onward and breaking the chained
block hash, so every subsequent block's hash changes and a full recompute
is required.

Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the
merge-vs-inplace switch but defaulted to the conservative (merge) path.
This closes the gap: resolve the model's actual chat template (via
resolve_chat_template, the same call the renderer uses) before deciding,
so GLM-like templates are treated as "no merge" and a trailing system
message no longer destroys the prefix.

Changes:
- __init__: when chat_template is None, resolve the model's actual
  template via resolve_chat_template(tokenizer, None, None,
  model_config=...) (tokenizer from self.renderer.tokenizer), then feed
  that to _detect_merge_inline_system. Falls back to the existing
  conservative behavior only when no template can be resolved.
- _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext
  explicitly (the package __init__ does not pull them in, so
  jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something
  else had imported the submodule). Broadened the catch to
  except Exception so malformed/non-template input falls back cleanly.
- tests: add test_glm_template_does_not_merge.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Stardust-minus <stardust@fish.audio>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026
…tem messages (vllm-project#46025)

Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: felix0080 <felix0080@users.noreply.github.com>
Co-authored-by: Ben Browning <bbrownin@redhat.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…tem messages (vllm-project#46025)

Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: felix0080 <felix0080@users.noreply.github.com>
Co-authored-by: Ben Browning <bbrownin@redhat.com>
Stardust-minus added a commit to Stardust-minus/vllm that referenced this pull request Jun 23, 2026
…stem merge

When a client (e.g. Claude Code) appends a system message to the
messages array of a /v1/messages request, the Anthropic compatibility
layer could trigger a full prompt recompute (observed as a ~75s TTFT
spike) because the entire prefix-cache block chain is invalidated
from the system block onward.

Root cause: AnthropicServingMessages.__init__ calls
_detect_merge_inline_system(chat_template) with the raw --chat-template
arg, which is None by default. The detector returns True ("merge inline
system into the leading block") whenever the template is falsy — even for
templates like GLM that already accept mid-conversation system messages.
That merge hoists the trailing system message into the leading system
block, changing tokens from offset 0 onward and breaking the chained
block hash, so every subsequent block's hash changes and a full recompute
is required.

Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the
merge-vs-inplace switch but defaulted to the conservative (merge) path.
This closes the gap: resolve the model's actual chat template (via
resolve_chat_template, the same call the renderer uses) before deciding,
so GLM-like templates are treated as "no merge" and a trailing system
message no longer destroys the prefix.

Changes:
- __init__: when chat_template is None, resolve the model's actual
  template via resolve_chat_template(tokenizer, None, None,
  model_config=...) (tokenizer from self.renderer.tokenizer), then feed
  that to _detect_merge_inline_system. Falls back to the existing
  conservative behavior only when no template can be resolved.
- _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext
  explicitly (the package __init__ does not pull them in, so
  jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something
  else had imported the submodule). Broadened the catch to
  except Exception so malformed/non-template input falls back cleanly.
- tests: add test_glm_template_does_not_merge.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Stardust-minus <stardust@fish.audio>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…tem messages (vllm-project#46025)

Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: felix0080 <felix0080@users.noreply.github.com>
Co-authored-by: Ben Browning <bbrownin@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed

2 participants