Skip to content

[Bugfix][Model Runner V2] Fix min_tokens off-by-one in the V2 GPU sampler#46243

Merged
njhill merged 2 commits into
vllm-project:mainfrom
Sunt-ing:samp-2
Jun 21, 2026
Merged

[Bugfix][Model Runner V2] Fix min_tokens off-by-one in the V2 GPU sampler#46243
njhill merged 2 commits into
vllm-project:mainfrom
Sunt-ing:samp-2

Conversation

@Sunt-ing

Copy link
Copy Markdown
Contributor

Purpose

min_tokens=N should let EOS through at output index N (the N+1-th token), as the V1 MinTokensLogitsProcessor does. The V2 GPU sampler releases it one step late, so min_tokens=N silently forces N+1 non-EOS tokens. This is the default path for mainstream archs (Llama, Qwen3, Mistral, ...).

The kernel in vllm/v1/worker/gpu/sample/logit_bias.py suppresses stop tokens while pos < min_len, but pos is the last token's position (current length minus one), so it stops one step late. Compare the current length instead:

if num_stop_token_ids > 0 and pos + 1 < min_len:

min_tokens=0 is untouched (already guarded by num_stop_token_ids > 0).

Test Plan

Force EOS via logit_bias so it is selected the instant it is unblocked, then compare the generated length against V1.

from vllm import LLM, SamplingParams

llm = LLM("Qwen/Qwen3-0.6B", enforce_eager=True)
eos = llm.get_tokenizer().eos_token_id
out = llm.generate(
    "Hello",
    SamplingParams(temperature=0, min_tokens=4, max_tokens=32, logit_bias={eos: 100.0}),
)
print(len(out[0].outputs[0].token_ids))  # main: 6 (min_tokens + 2); fixed: 5 (min_tokens + 1)

Test Result

RTX 4090, Qwen/Qwen3-0.6B, forced EOS, generated length per min_tokens:

min_tokens V1 reference V2 + fix V2 (main)
0 1 1 1
1 2 2 3
2 3 3 4
4 5 5 6

V2 with the fix matches V1 exactly; main is one token long for every min_tokens >= 1.

AI assistance was used to investigate, reproduce, and draft this change; the author reviewed the diff and validation output.

…pler

The V2 GPU sampler suppressed stop tokens while pos < min_len, where pos is
the position of the last existing token (current length minus one), so EOS was
released at output index min_tokens + 1 instead of min_tokens. Compare the
current length (pos + 1) against min_len so EOS becomes selectable at exactly
min_tokens, matching the V1 MinTokensLogitsProcessor.

Signed-off-by: Ting Sun <suntcrick@gmail.com>
@mergify mergify Bot added v1 bug Something isn't working labels Jun 20, 2026
@Sunt-ing

Copy link
Copy Markdown
Contributor Author

Hi @yewentao256, PTAL. No UT added :-)

@njhill njhill left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Sunt-ing, good catch

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 20, 2026
@njhill njhill enabled auto-merge (squash) June 20, 2026 19:12
@njhill njhill merged commit 183a430 into vllm-project:main Jun 21, 2026
79 checks passed
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…pler (vllm-project#46243)

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

2 participants