[Bugfix][Model Runner V2] Fix min_tokens off-by-one in the V2 GPU sampler#46243
Merged
Conversation
…pler The V2 GPU sampler suppressed stop tokens while pos < min_len, where pos is the position of the last existing token (current length minus one), so EOS was released at output index min_tokens + 1 instead of min_tokens. Compare the current length (pos + 1) against min_len so EOS becomes selectable at exactly min_tokens, matching the V1 MinTokensLogitsProcessor. Signed-off-by: Ting Sun <suntcrick@gmail.com>
Contributor
Author
|
Hi @yewentao256, PTAL. No UT added :-) |
njhill
approved these changes
Jun 20, 2026
tunglinwood
pushed a commit
to tunglinwood/vllm
that referenced
this pull request
Jun 22, 2026
…pler (vllm-project#46243) Signed-off-by: Ting Sun <suntcrick@gmail.com>
nkzhenhua
pushed a commit
to nkzhenhua/vllm
that referenced
this pull request
Jun 24, 2026
…pler (vllm-project#46243) Signed-off-by: Ting Sun <suntcrick@gmail.com>
qli88
pushed a commit
to qli88/vllm
that referenced
this pull request
Jun 26, 2026
…pler (vllm-project#46243) Signed-off-by: Ting Sun <suntcrick@gmail.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
min_tokens=Nshould let EOS through at output indexN(theN+1-th token), as the V1MinTokensLogitsProcessordoes. The V2 GPU sampler releases it one step late, somin_tokens=Nsilently forcesN+1non-EOS tokens. This is the default path for mainstream archs (Llama, Qwen3, Mistral, ...).The kernel in
vllm/v1/worker/gpu/sample/logit_bias.pysuppresses stop tokens whilepos < min_len, butposis the last token's position (current length minus one), so it stops one step late. Compare the current length instead:min_tokens=0is untouched (already guarded bynum_stop_token_ids > 0).Test Plan
Force EOS via
logit_biasso it is selected the instant it is unblocked, then compare the generated length against V1.Test Result
RTX 4090,
Qwen/Qwen3-0.6B, forced EOS, generated length permin_tokens:V2 with the fix matches V1 exactly;
mainis one token long for everymin_tokens >= 1.AI assistance was used to investigate, reproduce, and draft this change; the author reviewed the diff and validation output.