Skip to content

[Benchmark] Auto-detect and correct client/server tokenizer mismatch for random dataset#44708

Merged
DarkLight1337 merged 2 commits into
vllm-project:mainfrom
akii96:bench-tokenizer-mismatch-guard
Jun 8, 2026
Merged

[Benchmark] Auto-detect and correct client/server tokenizer mismatch for random dataset#44708
DarkLight1337 merged 2 commits into
vllm-project:mainfrom
akii96:bench-tokenizer-mismatch-guard

Conversation

@akii96

@akii96 akii96 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Alternative to #42532. Addresses the same problem (input token inflation when bench-side and server-side tokenizers disagree) but takes a different approach based on reviewer feedback there:

  • Zero changes to dataset code (@DarkLight1337 's main concern)
  • No new CLI flags
  • catches any model/tokenizer version mismatch, not just the current DeepSeek-V3.2 case

After get_samples(), probes the server's /tokenize endpoint with the first prompt. If counts match, returns immediately. If not, re-aligns all prompts via /tokenize + /detokenize so server-side token counts are exact.

Verified on MI355X

  • Model: deepseek-ai/DeepSeek-V3.2
  • transformers: 5.9.0
  • Image: vllm/vllm-openai-rocm:nightly-3f0a91bb96f8d72e0498b95c166e817deae14d62
  • Serve: VLLM_ROCM_USE_AITER=1 VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 vllm serve deepseek-ai/DeepSeek-V3.2 --tensor-parallel-size 8 --gpu-memory-utilization 0.85 --kv-cache-dtype fp8_e4m3 --block-size 64 --enable-expert-parallel --max_model_len 131072
  • Command: vllm bench serve --model deepseek-ai/DeepSeek-V3.2 --dataset-name random --num-prompts 10 --max-concurrency 4 --input-len 1000 --output-len 100 --random-range-ratio 0

Before this version of the fix

============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  7.22      
Total input tokens:                      46359     
Total generated tokens:                  1000      
Request throughput (req/s):              1.39      
Output token throughput (tok/s):         138.59    
Total token throughput (tok/s):          6563.52   
Mean TTFT (ms):                          918.93    
Mean TPOT (ms):                          15.99     
==================================================

After

WARNING: tokenizer mismatch (server=6082, expected=1000), re-aligning prompts.
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  5.20      
Total input tokens:                      10000     
Total generated tokens:                  1000      
Request throughput (req/s):              1.92      
Output token throughput (tok/s):         192.47    
Total token throughput (tok/s):          2117.16   
Mean TTFT (ms):                          406.75    
Mean TPOT (ms):                          13.88     
==================================================

Edit: Forgot to thank @frida-andersson for the initial digging into the issue and the pioneer work on this. Also, your review would be appreciated!

@mergify mergify Bot added the performance Performance-related issues label Jun 6, 2026
@akii96 akii96 marked this pull request as ready for review June 6, 2026 03:11

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@akii96

akii96 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

@AndreasKaratzas @DarkLight1337 @tjtanaa would this be a more acceptable fix for the inflated token issue on DS32 ? (Not sure why but no reviewers on the PR ?)

If not I can close this immediately 😅 . I was just trying to solve this quickly now as other folks are now getting impacted by this when comparing perf across images. But I see this still as a bit of future proofing

@DarkLight1337

Copy link
Copy Markdown
Member

cc @frida-andersson since you are the author of the original PR

@AndreasKaratzas

Copy link
Copy Markdown
Member

@AndreasKaratzas @DarkLight1337 @tjtanaa would this be a more acceptable fix for the inflated token issue on DS32 ? (Not sure why but no reviewers on the PR ?)

If not I can close this immediately 😅 . I was just trying to solve this quickly now as other folks are now getting impacted by this when comparing perf across images. But I see this still as a bit of future proofing

Looks good. I'm not the best guy to review this. At the same time I don't completely understand why tokenizers could disagree (hence seems a bit like masking the issue), but again I'm not sure I'm the best guy to review this so it might actually be the way to go here. I checked the other PR very briefly too and there seems not to be a good explanation of why this can happen. I also saw @DarkLight1337 actually commenting the same thing there.

@akii96

akii96 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

@AndreasKaratzas So the main answer to this

I don't completely understand why tokenizers could disagree

For DeepSeek-V3.2, transformers >= 5.0 doesn't have native support yet (huggingface/transformers#41251), so it is silently falls back to a wrong tokenizer. The server is laoding the right one.

@frida-andersson

frida-andersson commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Thanks for picking this up @akii96 ! This approach is clean and the "zero changes to dataset code" property is a real win. I only have one minor comment - if /tokenize returns a 503/404 or the endpoint isn't available (non-vLLM backends), except Exception: return input_requests proceeds with wrong counts and no warning. Suggestion: add a print("WARNING: /tokenize unavailable, skipping alignment.") in the except block. Closing my PR

…for random dataset

Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com>
Co-authored-by: Frida Andersson <fanderss@amd.com>
@akii96 akii96 force-pushed the bench-tokenizer-mismatch-guard branch from 816171d to f9ac854 Compare June 8, 2026 09:16
@akii96

akii96 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @frida-andersson 🙏

@DarkLight1337 addressed Frida's nit (warning on /tokenize unavailable)
This should be ready for review when you get a chance! (maybe a ready label could be added too)

@DarkLight1337 DarkLight1337 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much simpler, thanks!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) June 8, 2026 10:22
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 8, 2026
@DarkLight1337 DarkLight1337 merged commit ac3409d into vllm-project:main Jun 8, 2026
34 checks passed
ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Jun 9, 2026
…for random dataset (vllm-project#44708)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
…for random dataset (vllm-project#44708)

Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…for random dataset (vllm-project#44708)

Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
ohsono pushed a commit to ohsono/vllm that referenced this pull request Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

4 participants