fix(anthropic): auto-detect template support for mid-conversation system messages by felix0080 · Pull Request #46025 · vllm-project/vllm

felix0080 · 2026-06-18T09:28:01Z

Problem

#44602 preserves inline role: system messages at their original position for prefix caching. However, some models have chat templates that reject system messages appearing after the first position (e.g., Qwen3.5/3.6, #41114). Without mitigation, these models would return 400 errors for valid Anthropic requests.

Fix

Auto-detect whether the chat template supports mid-conversation system messages by rendering a [system, user, system, user] test conversation at init time. The detection covers three scenarios:

Template raises (e.g. Qwen loop.first guard)
→ inline system messages are merged into the top-level block
Template succeeds (most models)
→ inline system messages are preserved in-place for optimal prefix caching
No template (chat_template=None)
→ conservative default: merge

No flag or configuration needed — the server adapts automatically to
the model's template capabilities.

Changes

2 files, +96 lines:

_detect_merge_inline_system() classmethod — renders a Jinja test conversation, caches result as class attribute
_convert_system_message() — when merging, extracts inline system messages into the top-level block
_convert_messages() — when merging, skips system messages (already handled by _convert_system_message)
Tests: 3 cases covering Qwen guard, unrestricted template, and None template

Behavior matrix

Template	Detection	Effect
Qwen (system-first)	merge=True	merge → compatible
Llama (no restriction)	merge=False	preserve → cache optimal
None (no template)	merge=True	merge → safe default

Test Plan

(AI assistance was used; I reviewed every changed line.)

python -m pytest tests/entrypoints/anthropic/test_anthropic_messages_conversion.py::TestDetectMergeInlineSystem -v

felix0080 · 2026-06-18T09:32:49Z

@chaunceyjiang @bbrowning follow-up to #44602 as discussed. Auto-detection added — Qwen models get inline system messages merged automatically, models without template restrictions keep the cache-friendly behavior.

bbrowning

This is a good start, but a few changes requested around mutable state on classes and jinja sandboxing for security.

…tem messages When the chat template requires system-first ordering (e.g., Qwen3.5/3.6 with its loop.first guard), inline system messages preserved at their original position by vllm-project#44602 would be rejected at template render time. This adds auto-detection: a [system, user, system, user] conversation is rendered against the template at init time. The detection covers three scenarios: - Template raises → merge inline system into top-level block (compatible with Qwen and similar models) - Template succeeds → preserve inline system in-place (optimal prefix caching for models that support it) - No template → conservative default: merge No flag or configuration needed. Co-authored-by: Hermes Agent Signed-off-by: felix0080 <felix0080@users.noreply.github.com>

bbrowning · 2026-06-18T14:59:23Z

I'm seeing ruff format failures in the pre-commit hook locally when I pulled this, which will fail in CI as well. You can see this locally to fix with:

pre-commit run --files tests/entrypoints/anthropic/test_anthropic_messages_conversion.py vllm/entrypoints/anthropic/serving.py

Otherwise, looks good! I tested on a live server and after #44602 merged get a 500 error from Qwen 3.6-27B and Claude Code without this PR and everything works fine with this PR.

[Edit]: I just pushed a commit with the ruff fixes on top of this to get this merged.

Signed-off-by: Ben Browning <bbrownin@redhat.com>

bbrowning

Thanks for iterating on this with me and fixing this for models that don't support duplicate system messages!

felix0080 · 2026-06-18T20:38:31Z

@bbrowning Thanks for the review, testing, and pushing the ruff fixes — appreciate you iterating with me on this.

…tem messages (vllm-project#46025) Signed-off-by: felix0080 <felix0080@users.noreply.github.com> Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: felix0080 <felix0080@users.noreply.github.com> Co-authored-by: Ben Browning <bbrownin@redhat.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

…stem merge When a client (e.g. Claude Code) appends a system message to the messages array of a /v1/messages request, the Anthropic compatibility layer could trigger a full prompt recompute (observed as a ~75s TTFT spike) because the entire prefix-cache block chain is invalidated from the system block onward. Root cause: AnthropicServingMessages.__init__ calls _detect_merge_inline_system(chat_template) with the raw --chat-template arg, which is None by default. The detector returns True ("merge inline system into the leading block") whenever the template is falsy — even for templates like GLM that already accept mid-conversation system messages. That merge hoists the trailing system message into the leading system block, changing tokens from offset 0 onward and breaking the chained block hash, so every subsequent block's hash changes and a full recompute is required. Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the merge-vs-inplace switch but defaulted to the conservative (merge) path. This closes the gap: resolve the model's actual chat template (via resolve_chat_template, the same call the renderer uses) before deciding, so GLM-like templates are treated as "no merge" and a trailing system message no longer destroys the prefix. Changes: - __init__: when chat_template is None, resolve the model's actual template via resolve_chat_template(tokenizer, None, None, model_config=...) (tokenizer from self.renderer.tokenizer), then feed that to _detect_merge_inline_system. Falls back to the existing conservative behavior only when no template can be resolved. - _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext explicitly (the package __init__ does not pull them in, so jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something else had imported the submodule). Broadened the catch to except Exception so malformed/non-template input falls back cleanly. - tests: add test_glm_template_does_not_merge. Co-Authored-By: Claude <noreply@anthropic.com>

…stem merge When a client (e.g. Claude Code) appends a system message to the messages array of a /v1/messages request, the Anthropic compatibility layer could trigger a full prompt recompute (observed as a ~75s TTFT spike) because the entire prefix-cache block chain is invalidated from the system block onward. Root cause: AnthropicServingMessages.__init__ calls _detect_merge_inline_system(chat_template) with the raw --chat-template arg, which is None by default. The detector returns True ("merge inline system into the leading block") whenever the template is falsy — even for templates like GLM that already accept mid-conversation system messages. That merge hoists the trailing system message into the leading system block, changing tokens from offset 0 onward and breaking the chained block hash, so every subsequent block's hash changes and a full recompute is required. Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the merge-vs-inplace switch but defaulted to the conservative (merge) path. This closes the gap: resolve the model's actual chat template (via resolve_chat_template, the same call the renderer uses) before deciding, so GLM-like templates are treated as "no merge" and a trailing system message no longer destroys the prefix. Changes: - __init__: when chat_template is None, resolve the model's actual template via resolve_chat_template(tokenizer, None, None, model_config=...) (tokenizer from self.renderer.tokenizer), then feed that to _detect_merge_inline_system. Falls back to the existing conservative behavior only when no template can be resolved. - _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext explicitly (the package __init__ does not pull them in, so jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something else had imported the submodule). Broadened the catch to except Exception so malformed/non-template input falls back cleanly. - tests: add test_glm_template_does_not_merge. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Stardust-minus <stardust@fish.audio>

…tem messages (vllm-project#46025) Signed-off-by: felix0080 <felix0080@users.noreply.github.com> Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: felix0080 <felix0080@users.noreply.github.com> Co-authored-by: Ben Browning <bbrownin@redhat.com>

…stem merge When a client (e.g. Claude Code) appends a system message to the messages array of a /v1/messages request, the Anthropic compatibility layer could trigger a full prompt recompute (observed as a ~75s TTFT spike) because the entire prefix-cache block chain is invalidated from the system block onward. Root cause: AnthropicServingMessages.__init__ calls _detect_merge_inline_system(chat_template) with the raw --chat-template arg, which is None by default. The detector returns True ("merge inline system into the leading block") whenever the template is falsy — even for templates like GLM that already accept mid-conversation system messages. That merge hoists the trailing system message into the leading system block, changing tokens from offset 0 onward and breaking the chained block hash, so every subsequent block's hash changes and a full recompute is required. Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the merge-vs-inplace switch but defaulted to the conservative (merge) path. This closes the gap: resolve the model's actual chat template (via resolve_chat_template, the same call the renderer uses) before deciding, so GLM-like templates are treated as "no merge" and a trailing system message no longer destroys the prefix. Changes: - __init__: when chat_template is None, resolve the model's actual template via resolve_chat_template(tokenizer, None, None, model_config=...) (tokenizer from self.renderer.tokenizer), then feed that to _detect_merge_inline_system. Falls back to the existing conservative behavior only when no template can be resolved. - _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext explicitly (the package __init__ does not pull them in, so jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something else had imported the submodule). Broadened the catch to except Exception so malformed/non-template input falls back cleanly. - tests: add test_glm_template_does_not_merge. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Stardust-minus <stardust@fish.audio>

…tem messages (vllm-project#46025) Signed-off-by: felix0080 <felix0080@users.noreply.github.com> Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: felix0080 <felix0080@users.noreply.github.com> Co-authored-by: Ben Browning <bbrownin@redhat.com>

felix0080 requested review from AndreasKaratzas, DarkLight1337, NickLucche, aarnphm, mgoin and robertgshaw2-redhat as code owners June 18, 2026 09:28

mergify Bot added the frontend label Jun 18, 2026

felix0080 mentioned this pull request Jun 18, 2026

fix(anthropic): preserve inline system message position for prefix caching #44602

Merged

bbrowning requested changes Jun 18, 2026

View reviewed changes

Comment thread vllm/entrypoints/anthropic/serving.py Outdated

Comment thread vllm/entrypoints/anthropic/serving.py Outdated

Comment thread vllm/entrypoints/anthropic/serving.py Outdated

Comment thread vllm/entrypoints/anthropic/serving.py Outdated

felix0080 force-pushed the fix/anthropic-auto-detect-inline-system branch from a13a316 to 77de025 Compare June 18, 2026 13:31

felix0080 force-pushed the fix/anthropic-auto-detect-inline-system branch from 77de025 to 0ec120a Compare June 18, 2026 13:35

felix0080 requested a review from bbrowning June 18, 2026 13:39

ruff cleanup

8cb4933

Signed-off-by: Ben Browning <bbrownin@redhat.com>

bbrowning approved these changes Jun 18, 2026

View reviewed changes

bbrowning added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 18, 2026

vllm-project deleted a comment from mergify Bot Jun 18, 2026

bbrowning merged commit 4ce2d01 into vllm-project:main Jun 18, 2026
51 of 53 checks passed

Stardust-minus mentioned this pull request Jun 19, 2026

fix(anthropic): resolve model chat template before deciding inline-system merge #46196

Open

This was referenced Jun 22, 2026

fix(anthropic): handle mid-conversation system messages sgl-project/sglang#26773

Merged

[Anthropic] Mid-conversation system messages are hoisted to top-level, forking the prefix cache on inline-capable templates sgl-project/sglang#28883

Closed

JustinTong0323 mentioned this pull request Jun 22, 2026

fix(anthropic): detect-and-passthrough mid-conversation system messages sgl-project/sglang#28906

Merged

trilamsr mentioned this pull request Jun 22, 2026

[Anthropic] Skip mid-conv system hoist on inline-capable templates sgl-project/sglang#28955

Closed

wqh17101 mentioned this pull request Jun 25, 2026

[Bug] DeepSeekV4-Flash produces incorrect output with inline system messages after PR #46025 when preserved in-place #46710

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(anthropic): auto-detect template support for mid-conversation system messages#46025

fix(anthropic): auto-detect template support for mid-conversation system messages#46025
bbrowning merged 2 commits into
vllm-project:mainfrom
felix0080:fix/anthropic-auto-detect-inline-system

felix0080 commented Jun 18, 2026 •

edited

Loading

felix0080 commented Jun 18, 2026

bbrowning left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bbrowning commented Jun 18, 2026 •

edited

Loading

bbrowning left a comment

Uh oh!

felix0080 commented Jun 18, 2026

Labels

2 participants

Uh oh!

Uh oh!

Conversation

felix0080 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Changes

Behavior matrix

Related

Test Plan

felix0080 commented Jun 18, 2026

bbrowning left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bbrowning commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

bbrowning left a comment

Choose a reason for hiding this comment

Uh oh!

felix0080 commented Jun 18, 2026

Labels

2 participants

felix0080 commented Jun 18, 2026 •

edited

Loading

bbrowning commented Jun 18, 2026 •

edited

Loading