[Bugfix] Fix Gemma4 MTP block_table batch_size mismatch under concurrent load#43982
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. Agent GuidelinesIMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban. 🚀 |
… batch occupancy The per-group block tables stored via set_per_group_block_table() retain the padded batch dimension (num_reqs_padded) from the target forward pass. When the drafter's common_attn_metadata is unpadded to num_reqs, the secondary group's block_table is swapped in without slicing, causing flash attention to fail with 'batch_size must be equal to batch_size_k' when num_reqs < num_reqs_padded. Fix: slice per-group block tables to [:batch_size] in build_per_group_and_layer_attn_metadata. Signed-off-by: Dmytro Kuntso <dkuntso@amazon.co.uk>
9e344c5 to
c70da15
Compare
benchislett
left a comment
There was a problem hiding this comment.
LGTM, seems reasonable.
@LucasWilkinson @TheEpicDolphin have we seen this before? Any idea why this took so long to pop up?
…ent load (vllm-project#43982) Signed-off-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Co-authored-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…ent load (vllm-project#43982) Signed-off-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Co-authored-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Signed-off-by: JisoLya <523420504@qq.com>
…ent load (vllm-project#43982) Signed-off-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Co-authored-by: Dmytro Kuntso <dkuntso@amazon.co.uk>
…ent load (vllm-project#43982) Signed-off-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Co-authored-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
…ent load (vllm-project#43982) Signed-off-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Co-authored-by: Dmytro Kuntso <dkuntso@amazon.co.uk>
…ent load (vllm-project#43982) Signed-off-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Co-authored-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Signed-off-by: divineearthly <divineearthly@gmail.com>
…ent load (vllm-project#43982) Signed-off-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Co-authored-by: Dmytro Kuntso <dkuntso@amazon.co.uk>
…ent load (vllm-project#43982) Signed-off-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Co-authored-by: Dmytro Kuntso <dkuntso@amazon.co.uk>
Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main (2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner, worker, attention, KV cache, compilation, and structured output fixes. Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252 Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726 vllm-project#40727 vllm-project#40737 vllm-project#40749 vllm-project#40961 vllm-project#41119 vllm-project#41133 vllm-project#41233 vllm-project#41237 vllm-project#41411 vllm-project#41496 vllm-project#41549 vllm-project#41674 vllm-project#41873 vllm-project#41895 vllm-project#42040 vllm-project#42112 vllm-project#42289 vllm-project#42479 vllm-project#42585 vllm-project#42692 vllm-project#42706 vllm-project#42709 vllm-project#42739 vllm-project#42967 vllm-project#43001 vllm-project#43079 vllm-project#43125 vllm-project#43160 vllm-project#43616 vllm-project#43669 vllm-project#43719 vllm-project#43768 vllm-project#43808 vllm-project#43961 vllm-project#43982 vllm-project#43988 vllm-project#43998 vllm-project#44057 vllm-project#44560 vllm-project#44574 vllm-project#44568 vllm-project#44603 vllm-project#44744 vllm-project#45195 vllm-project#45345 vllm-project#45383 vllm-project#45487 vllm-project#45564 vllm-project#45673 Runner fix (2): vllm-project#44568 vllm-project#44603 Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU) Conflict resolutions: - Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560 - Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195 - Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982 Co-authored-by: GitHub Copilot Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>
Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main (2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner, worker, attention, KV cache, compilation, and structured output fixes. Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252 Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726 vllm-project#40727 vllm-project#40737 vllm-project#40749 vllm-project#40961 vllm-project#41119 vllm-project#41133 vllm-project#41233 vllm-project#41237 vllm-project#41411 vllm-project#41496 vllm-project#41549 vllm-project#41674 vllm-project#41873 vllm-project#41895 vllm-project#42040 vllm-project#42112 vllm-project#42289 vllm-project#42479 vllm-project#42585 vllm-project#42692 vllm-project#42706 vllm-project#42709 vllm-project#42739 vllm-project#42967 vllm-project#43001 vllm-project#43079 vllm-project#43125 vllm-project#43160 vllm-project#43616 vllm-project#43669 vllm-project#43719 vllm-project#43768 vllm-project#43808 vllm-project#43961 vllm-project#43982 vllm-project#43988 vllm-project#43998 vllm-project#44057 vllm-project#44560 vllm-project#44574 vllm-project#44568 vllm-project#44603 vllm-project#44744 vllm-project#45195 vllm-project#45345 vllm-project#45383 vllm-project#45487 vllm-project#45564 vllm-project#45673 Runner fix (2): vllm-project#44568 vllm-project#44603 Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU) Conflict resolutions: - Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560 - Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195 - Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982 Co-authored-by: GitHub Copilot Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>
Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main (2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner, worker, attention, KV cache, compilation, and structured output fixes. Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252 Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726 Runner fix (2): vllm-project#44568 vllm-project#44603 Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU) Conflict resolutions: - Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560 - Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195 - Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982 Co-authored-by: GitHub Copilot Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>
…ent load (vllm-project#43982) Signed-off-by: Dmytro Kuntso <dkuntso@amazon.co.uk> Co-authored-by: Dmytro Kuntso <dkuntso@amazon.co.uk>
Purpose
Fix
RuntimeError: batch_size must be equal to batch_size_kthat occurs with Gemma4 + MTP + FlashAttention under concurrent load when the batch is partially occupied.Gemma4Proposer.set_per_group_block_table()captures block tables with shape(num_reqs_padded, max_blocks)during_prepare_inputs. Later,spec_decode_common_attn_metadatais unpadded tonum_reqsvia.unpadded(), but the per-group block tables stored on the proposer instance are never sliced. Whenbuild_per_group_and_layer_attn_metadataswaps in the stored block table for secondary KV cache groups, FlashAttention seespage_table.size(0) == num_reqs_paddedwhilecu_seqlens_qimpliesbatch_size == num_reqs, triggering the assertion.Fix: slice
_per_group_block_tables[gid]to[:batch_size]inbuild_per_group_and_layer_attn_metadata.Stack Trace
Test Plan
Server start command
Test script
test_gemma4_mtp_batch.py
Run test
Test Result
Before fix (FlashAttention backend)
Crashes with
RuntimeError: batch_size must be equal to batch_size_kat >4 RPS whennum_reqs < num_reqs_padded(partial batch with CUDA graph padding).Result
=== Phase 1: Warmup === [req 0] OK - 128 tokens in 0.67s=== Phase 2: Concurrent load (concurrency=8) ===
--- Wave 1 ---
[req 1] OK - 80 tokens in 1.89s
[req 4] OK - 128 tokens in 1.81s
[req 2] OK - 128 tokens in 1.91s
[req 3] ERROR 500: {"error":{"message":"EngineCore encountered an issue
. See stack trace (above) for the root cause.","type":"InternalServerEr
ror","param":null,"code":500}}
[req 7] ERROR 500: {"error":{"message":"EngineCore encountered an issue
. See stack trace (above) for the root cause.","type":"InternalServerEr
ror","param":null,"code":500}}
[req 5] ERROR 500: {"error":{"message":"EngineCore encountered an issue
. See stack trace (above) for the root cause.","type":"InternalServerEr
ror","param":null,"code":500}}
[req 6] ERROR 500: {"error":{"message":"EngineCore encountered an issue
. See stack trace (above) for the root cause.","type":"InternalServerEr
ror","param":null,"code":500}}
[req 8] ERROR 500: {"error":{"message":"EngineCore encountered an issue
. See stack trace (above) for the root cause.","type":"InternalServerEr
ror","param":null,"code":500}}
FAILURES in wave 1: 5/8
--- Wave 2 ---
[req 9] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [M
ultiple exceptions: [Errno 111] Connect call failed ('::1', 8000, 0, 0)
, [Errno 111] Connect call failed ('127.0.0.1', 8000)]
[req 10] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8000
), [Errno 111] Connect call failed ('::1', 8000, 0, 0)]
[req 11] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('::1', 8000, 0, 0
), [Errno 111] Connect call failed ('127.0.0.1', 8000)]
[req 12] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8000
), [Errno 111] Connect call failed ('::1', 8000, 0, 0)]
[req 13] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('::1', 8000, 0, 0
), [Errno 111] Connect call failed ('127.0.0.1', 8000)]
[req 14] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8000
), [Errno 111] Connect call failed ('::1', 8000, 0, 0)]
[req 15] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('::1', 8000, 0, 0
), [Errno 111] Connect call failed ('127.0.0.1', 8000)]
[req 16] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8000
), [Errno 111] Connect call failed ('::1', 8000, 0, 0)]
FAILURES in wave 2: 8/8
--- Wave 3 ---
[req 17] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('::1', 8000, 0, 0
), [Errno 111] Connect call failed ('127.0.0.1', 8000)]
[req 18] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8000
), [Errno 111] Connect call failed ('::1', 8000, 0, 0)]
[req 19] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('::1', 8000, 0, 0
), [Errno 111] Connect call failed ('127.0.0.1', 8000)]
[req 20] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8000
), [Errno 111] Connect call failed ('::1', 8000, 0, 0)]
[req 21] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('::1', 8000, 0, 0
), [Errno 111] Connect call failed ('127.0.0.1', 8000)]
[req 22] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8000
), [Errno 111] Connect call failed ('::1', 8000, 0, 0)]
[req 23] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('::1', 8000, 0, 0
), [Errno 111] Connect call failed ('127.0.0.1', 8000)]
[req 24] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8000
), [Errno 111] Connect call failed ('::1', 8000, 0, 0)]
FAILURES in wave 3: 8/8
--- Wave 4 ---
[req 25] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('::1', 8000, 0, 0
), [Errno 111] Connect call failed ('127.0.0.1', 8000)]
[req 26] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8000
), [Errno 111] Connect call failed ('::1', 8000, 0, 0)]
[req 27] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('::1', 8000, 0, 0
), [Errno 111] Connect call failed ('127.0.0.1', 8000)]
[req 28] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8000
), [Errno 111] Connect call failed ('::1', 8000, 0, 0)]
[req 29] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('::1', 8000, 0, 0
), [Errno 111] Connect call failed ('127.0.0.1', 8000)]
[req 30] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8000
), [Errno 111] Connect call failed ('::1', 8000, 0, 0)]
[req 31] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('::1', 8000, 0, 0
), [Errno 111] Connect call failed ('127.0.0.1', 8000)]
[req 32] EXCEPTION: Cannot connect to host localhost:8000 ssl:default [
Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8000
), [Errno 111] Connect call failed ('::1', 8000, 0, 0)]
FAILURES in wave 4: 8/8
=== Results: 3 OK, 29 FAILED ===
BUG REPRODUCED: batch_size mismatch likely triggered
After fix (FlashAttention backend)
All 32 requests succeed across 4 waves of 8 concurrent requests with staggered timing:
Result
=== Phase 1: Warmup ===
[req 0] OK - 128 tokens in 0.35s
=== Phase 2: Concurrent load (concurrency=8) ===
--- Wave 1 ---
[req 1] OK - 72 tokens in 0.39s
[req 2] OK - 128 tokens in 0.36s
[req 3] OK - 128 tokens in 0.34s
[req 4] OK - 128 tokens in 0.36s
[req 5] OK - 128 tokens in 0.38s
[req 6] OK - 128 tokens in 0.37s
[req 7] OK - 128 tokens in 0.35s
[req 8] OK - 128 tokens in 0.34s
--- Wave 2 ---
[req 9] OK - 128 tokens in 0.36s
[req 10] OK - 128 tokens in 0.39s
[req 11] OK - 128 tokens in 0.41s
[req 12] OK - 128 tokens in 0.37s
[req 13] OK - 128 tokens in 0.37s
[req 14] OK - 128 tokens in 0.35s
[req 15] OK - 128 tokens in 0.33s
[req 16] OK - 128 tokens in 0.37s
--- Wave 3 ---
[req 18] OK - 128 tokens in 0.36s
[req 17] OK - 74 tokens in 0.41s
[req 19] OK - 128 tokens in 0.36s
[req 20] OK - 128 tokens in 0.33s
[req 21] OK - 128 tokens in 0.37s
[req 22] OK - 128 tokens in 0.35s
[req 23] OK - 128 tokens in 0.34s
[req 24] OK - 128 tokens in 0.34s
--- Wave 4 ---
[req 25] OK - 128 tokens in 0.36s
[req 26] OK - 128 tokens in 0.38s
[req 27] OK - 128 tokens in 0.41s
[req 28] OK - 128 tokens in 0.37s
[req 29] OK - 128 tokens in 0.37s
[req 30] OK - 128 tokens in 0.35s
[req 31] OK - 128 tokens in 0.33s
[req 32] OK - 128 tokens in 0.37s
=== Results: 32 OK, 0 FAILED ===
All requests succeeded - fix appears to work
TritonAttention backend
Also tested with
--attention-backend TRITON_ATTN. No impact — TritonAttention was already working correctly before and after the fix since it does not perform the samebatch_size == batch_size_kassertion on the page table dimension.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.