fix(anthropic): auto-detect template support for mid-conversation system messages#46025
Conversation
|
@chaunceyjiang @bbrowning follow-up to #44602 as discussed. Auto-detection added — Qwen models get inline system messages merged automatically, models without template restrictions keep the cache-friendly behavior. |
bbrowning
left a comment
There was a problem hiding this comment.
This is a good start, but a few changes requested around mutable state on classes and jinja sandboxing for security.
a13a316 to
77de025
Compare
…tem messages When the chat template requires system-first ordering (e.g., Qwen3.5/3.6 with its loop.first guard), inline system messages preserved at their original position by vllm-project#44602 would be rejected at template render time. This adds auto-detection: a [system, user, system, user] conversation is rendered against the template at init time. The detection covers three scenarios: - Template raises → merge inline system into top-level block (compatible with Qwen and similar models) - Template succeeds → preserve inline system in-place (optimal prefix caching for models that support it) - No template → conservative default: merge No flag or configuration needed. Co-authored-by: Hermes Agent Signed-off-by: felix0080 <felix0080@users.noreply.github.com>
77de025 to
0ec120a
Compare
|
I'm seeing ruff format failures in the pre-commit hook locally when I pulled this, which will fail in CI as well. You can see this locally to fix with:
Otherwise, looks good! I tested on a live server and after #44602 merged get a 500 error from Qwen 3.6-27B and Claude Code without this PR and everything works fine with this PR. [Edit]: I just pushed a commit with the ruff fixes on top of this to get this merged. |
Signed-off-by: Ben Browning <bbrownin@redhat.com>
bbrowning
left a comment
There was a problem hiding this comment.
Thanks for iterating on this with me and fixing this for models that don't support duplicate system messages!
|
@bbrowning Thanks for the review, testing, and pushing the ruff fixes — appreciate you iterating with me on this. |
…tem messages (vllm-project#46025) Signed-off-by: felix0080 <felix0080@users.noreply.github.com> Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: felix0080 <felix0080@users.noreply.github.com> Co-authored-by: Ben Browning <bbrownin@redhat.com> Signed-off-by: divineearthly <divineearthly@gmail.com>
…stem merge
When a client (e.g. Claude Code) appends a system message to the
messages array of a /v1/messages request, the Anthropic compatibility
layer could trigger a full prompt recompute (observed as a ~75s TTFT
spike) because the entire prefix-cache block chain is invalidated
from the system block onward.
Root cause: AnthropicServingMessages.__init__ calls
_detect_merge_inline_system(chat_template) with the raw --chat-template
arg, which is None by default. The detector returns True ("merge inline
system into the leading block") whenever the template is falsy — even for
templates like GLM that already accept mid-conversation system messages.
That merge hoists the trailing system message into the leading system
block, changing tokens from offset 0 onward and breaking the chained
block hash, so every subsequent block's hash changes and a full recompute
is required.
Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the
merge-vs-inplace switch but defaulted to the conservative (merge) path.
This closes the gap: resolve the model's actual chat template (via
resolve_chat_template, the same call the renderer uses) before deciding,
so GLM-like templates are treated as "no merge" and a trailing system
message no longer destroys the prefix.
Changes:
- __init__: when chat_template is None, resolve the model's actual
template via resolve_chat_template(tokenizer, None, None,
model_config=...) (tokenizer from self.renderer.tokenizer), then feed
that to _detect_merge_inline_system. Falls back to the existing
conservative behavior only when no template can be resolved.
- _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext
explicitly (the package __init__ does not pull them in, so
jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something
else had imported the submodule). Broadened the catch to
except Exception so malformed/non-template input falls back cleanly.
- tests: add test_glm_template_does_not_merge.
Co-Authored-By: Claude <noreply@anthropic.com>
…stem merge
When a client (e.g. Claude Code) appends a system message to the
messages array of a /v1/messages request, the Anthropic compatibility
layer could trigger a full prompt recompute (observed as a ~75s TTFT
spike) because the entire prefix-cache block chain is invalidated
from the system block onward.
Root cause: AnthropicServingMessages.__init__ calls
_detect_merge_inline_system(chat_template) with the raw --chat-template
arg, which is None by default. The detector returns True ("merge inline
system into the leading block") whenever the template is falsy — even for
templates like GLM that already accept mid-conversation system messages.
That merge hoists the trailing system message into the leading system
block, changing tokens from offset 0 onward and breaking the chained
block hash, so every subsequent block's hash changes and a full recompute
is required.
Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the
merge-vs-inplace switch but defaulted to the conservative (merge) path.
This closes the gap: resolve the model's actual chat template (via
resolve_chat_template, the same call the renderer uses) before deciding,
so GLM-like templates are treated as "no merge" and a trailing system
message no longer destroys the prefix.
Changes:
- __init__: when chat_template is None, resolve the model's actual
template via resolve_chat_template(tokenizer, None, None,
model_config=...) (tokenizer from self.renderer.tokenizer), then feed
that to _detect_merge_inline_system. Falls back to the existing
conservative behavior only when no template can be resolved.
- _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext
explicitly (the package __init__ does not pull them in, so
jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something
else had imported the submodule). Broadened the catch to
except Exception so malformed/non-template input falls back cleanly.
- tests: add test_glm_template_does_not_merge.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Stardust-minus <stardust@fish.audio>
…stem merge
When a client (e.g. Claude Code) appends a system message to the
messages array of a /v1/messages request, the Anthropic compatibility
layer could trigger a full prompt recompute (observed as a ~75s TTFT
spike) because the entire prefix-cache block chain is invalidated
from the system block onward.
Root cause: AnthropicServingMessages.__init__ calls
_detect_merge_inline_system(chat_template) with the raw --chat-template
arg, which is None by default. The detector returns True ("merge inline
system into the leading block") whenever the template is falsy — even for
templates like GLM that already accept mid-conversation system messages.
That merge hoists the trailing system message into the leading system
block, changing tokens from offset 0 onward and breaking the chained
block hash, so every subsequent block's hash changes and a full recompute
is required.
Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the
merge-vs-inplace switch but defaulted to the conservative (merge) path.
This closes the gap: resolve the model's actual chat template (via
resolve_chat_template, the same call the renderer uses) before deciding,
so GLM-like templates are treated as "no merge" and a trailing system
message no longer destroys the prefix.
Changes:
- __init__: when chat_template is None, resolve the model's actual
template via resolve_chat_template(tokenizer, None, None,
model_config=...) (tokenizer from self.renderer.tokenizer), then feed
that to _detect_merge_inline_system. Falls back to the existing
conservative behavior only when no template can be resolved.
- _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext
explicitly (the package __init__ does not pull them in, so
jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something
else had imported the submodule). Broadened the catch to
except Exception so malformed/non-template input falls back cleanly.
- tests: add test_glm_template_does_not_merge.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Stardust-minus <stardust@fish.audio>
…tem messages (vllm-project#46025) Signed-off-by: felix0080 <felix0080@users.noreply.github.com> Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: felix0080 <felix0080@users.noreply.github.com> Co-authored-by: Ben Browning <bbrownin@redhat.com>
…tem messages (vllm-project#46025) Signed-off-by: felix0080 <felix0080@users.noreply.github.com> Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: felix0080 <felix0080@users.noreply.github.com> Co-authored-by: Ben Browning <bbrownin@redhat.com>
…stem merge
When a client (e.g. Claude Code) appends a system message to the
messages array of a /v1/messages request, the Anthropic compatibility
layer could trigger a full prompt recompute (observed as a ~75s TTFT
spike) because the entire prefix-cache block chain is invalidated
from the system block onward.
Root cause: AnthropicServingMessages.__init__ calls
_detect_merge_inline_system(chat_template) with the raw --chat-template
arg, which is None by default. The detector returns True ("merge inline
system into the leading block") whenever the template is falsy — even for
templates like GLM that already accept mid-conversation system messages.
That merge hoists the trailing system message into the leading system
block, changing tokens from offset 0 onward and breaking the chained
block hash, so every subsequent block's hash changes and a full recompute
is required.
Builds on vllm-project#44283 / vllm-project#44602 / vllm-project#46025, which introduced the
merge-vs-inplace switch but defaulted to the conservative (merge) path.
This closes the gap: resolve the model's actual chat template (via
resolve_chat_template, the same call the renderer uses) before deciding,
so GLM-like templates are treated as "no merge" and a trailing system
message no longer destroys the prefix.
Changes:
- __init__: when chat_template is None, resolve the model's actual
template via resolve_chat_template(tokenizer, None, None,
model_config=...) (tokenizer from self.renderer.tokenizer), then feed
that to _detect_merge_inline_system. Falls back to the existing
conservative behavior only when no template can be resolved.
- _detect_merge_inline_system: import jinja2.sandbox / jinja2.ext
explicitly (the package __init__ does not pull them in, so
jinja2.sandbox.ImmutableSandboxedEnvironment only worked when something
else had imported the submodule). Broadened the catch to
except Exception so malformed/non-template input falls back cleanly.
- tests: add test_glm_template_does_not_merge.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Stardust-minus <stardust@fish.audio>
…tem messages (vllm-project#46025) Signed-off-by: felix0080 <felix0080@users.noreply.github.com> Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: felix0080 <felix0080@users.noreply.github.com> Co-authored-by: Ben Browning <bbrownin@redhat.com>
Problem
#44602 preserves inline
role: systemmessages at their original position for prefix caching. However, some models have chat templates that reject system messages appearing after the first position (e.g., Qwen3.5/3.6, #41114). Without mitigation, these models would return 400 errors for valid Anthropic requests.Fix
Auto-detect whether the chat template supports mid-conversation system messages by rendering a
[system, user, system, user]test conversation at init time. The detection covers three scenarios:loop.firstguard)→ inline system messages are merged into the top-level block
→ inline system messages are preserved in-place for optimal prefix caching
chat_template=None)→ conservative default: merge
No flag or configuration needed — the server adapts automatically to
the model's template capabilities.
Changes
2 files, +96 lines:
_detect_merge_inline_system()classmethod — renders a Jinja test conversation, caches result as class attribute_convert_system_message()— when merging, extracts inline system messages into the top-level block_convert_messages()— when merging, skips system messages (already handled by_convert_system_message)Behavior matrix
Related
Test Plan
(AI assistance was used; I reviewed every changed line.)