[Bench] Add BFCL dataset for vllm bench serve tool-calling workloads#42457
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Berkeley Function Calling Leaderboard (BFCL) dataset to the vLLM benchmarking suite. Key changes include the implementation of the BFCLDataset class, which manages data loading from Hugging Face, recursive translation of function schemas to OpenAI tool format, and balanced round-robin sampling across dataset categories. The benchmarking infrastructure was also updated to support pre-built chat messages and per-request overrides in SampleRequest and RequestFuncInput, enabling more accurate simulation of tool-calling scenarios. Feedback was provided regarding the silent suppression of exceptions during chat template application, recommending that errors be logged to facilitate debugging and prevent the masking of underlying tokenizer issues.
| except Exception: | ||
| rendered = None |
There was a problem hiding this comment.
Catching a generic Exception and silently setting rendered = None can hide important errors from tokenizer.apply_chat_template. If an unexpected error occurs, it will be suppressed, and the prompt length will be calculated using a fallback method. This can lead to inaccurate prompt length metrics and mask underlying issues in the tokenizer or chat template. It's better to log the exception to make debugging easier while maintaining robustness.
| except Exception: | |
| rendered = None | |
| except Exception as e: | |
| logger.warning("Failed to apply chat template with tools, falling back. Error: %s", e) | |
| rendered = None |
898136e to
8f3dd43
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
chaunceyjiang
left a comment
There was a problem hiding this comment.
I ran into an issue. Could you help me take a look?
Error 8: Not Found
vllm serve /mnt/data4/models/Qwen/Qwen3.5-27B-FP8 --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3
vllm bench serve --model /mnt/data4/models/Qwen/Qwen3.5-27B-FP8 \
--backend openai-chat --endpoint /v1/chat/completions \
--dataset-name hf \
--dataset-path gorilla-llm/Berkeley-Function-Calling-Leaderboard \
--bfcl-categories simple,live_simple,multiple \
--num-warmups 5 --temperature 0 --percentile-metrics ttft,tpot,itl,e2el \
--max-concurrency 8 --num-prompts 500
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7fb9711a1760>, trust_remote_code=False, seed=0, num_prompts=500, dataset_name='hf', no_stream=False, dataset_path='gorilla-llm/Berkeley-Function-Calling-Leaderboard', no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, timed_trace_chunk_hash_size=16, timed_trace_sec_multiplier=1, timed_trace_label_timestamp='timestamp', timed_trace_label_input_length='input_length', timed_trace_label_output_length='output_length', timed_trace_label_hash_ids='hash_ids', blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio='0.0', random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, bfcl_categories=['simple', 'live_simple', 'multiple'], prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, speed_bench_dataset_subset='qualitative', speed_bench_output_len=4096, speed_bench_category=None, label=None, backend='openai-chat', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/chat/completions', header=None, max_concurrency=8, model='/mnt/data4/models/Qwen/Qwen3.5-27B-FP8', input_len=None, output_len=None, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=5, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, self_timed=None, percentile_metrics='ttft,tpot,itl,e2el', metric_percentiles='99', goodput=None, request_id_prefix='bench-36142bd8-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds='25,50', plot_dataset_stats=False)
Starting initial single prompt test run...
Skipping endpoint ready check.
Warming up with 5 requests...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 2.37it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:03<00:00, 158.35it/s]
Failed requests during benchmark run detected (capping to 10):
Error 0: Not Found
Error 1: Not Found
Error 2: Not Found
Error 3: Not Found
Error 4: Not Found
Error 5: Not Found
Error 6: Not Found
Error 7: Not Found
Error 8: Not Found
Error 9: Not Found
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 5
Failed requests: 495
Maximum request concurrency: 8
Benchmark duration (s): 3.16
Total input tokens: 2841
Total generated tokens: 835
Request throughput (req/s): 1.58
Output token throughput (tok/s): 264.43
Peak output token throughput (tok/s): 353.00
Peak concurrent requests: 5.00
Total token throughput (tok/s): 1164.14
---------------Time to First Token----------------
Mean TTFT (ms): 580.51
Median TTFT (ms): 635.43
P99 TTFT (ms): 635.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 10.81
Median TPOT (ms): 10.66
P99 TPOT (ms): 11.43
---------------Inter-token Latency----------------
Mean ITL (ms): 14.75
Median ITL (ms): 10.69
P99 ITL (ms): 133.66
----------------End-to-end Latency----------------
Mean E2EL (ms): 2386.45
Median E2EL (ms): 2376.93
P99 E2EL (ms): 3128.01
==================================================
Thanks for trying it. I can't reproduce locally by running the same bench flags ( "Not Found" here is HTTP 404 from the streaming chat backend, and the possible place vLLM's chat path returns 404 on a per-request basis is Could you grab three things so I can confirm?
If this returns 404, the problem is between server registration and bench --model (likely --served-model-name mismatch on the server, a stale tag, or something similar). If it returns 200, the problem is BFCL-specific and we should dig further. One quick hypothesis to rule out: was the server started with any --served-model-name, or is that exact path printed by the /v1/models endpoint? If the server ended up registering a different name (canonical Hugging Face id, for example), vllm bench serve --model would 404 every request — except for the 5 warmup ones, which the warmup loop completes silently regardless of failure (asyncio.gather(*warmup_tasks) doesn't check output.success). Worth checking: did warmup actually produce useful output, or is the 5 successes you see actually unrelated to warmup at all? |
|
OK, I'll run some more tests today. |
chaunceyjiang
left a comment
There was a problem hiding this comment.
LGTM.
Nice work!!
Head branch was pushed to by a user without write access
Adds a BFCLDataset that lets `vllm bench serve --backend openai-chat` replay the Berkeley Function Calling Leaderboard, so users can measure serving latency/throughput on realistic tool-calling traffic. Complements the merged correctness harness in vllm-project#36560; no code overlap. See the PR description for design details. AI-assisted: drafted with Claude (Opus 4.7); author reviewed every line. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Li Zhang <lzhanga@amazon.com>
|
Documentation preview: https://vllm--42457.org.readthedocs.build/en/42457/ |
…llm-project#42457) Signed-off-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
…llm-project#42457) Signed-off-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
…llm-project#42457) Signed-off-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
…llm-project#42457) Signed-off-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: divineearthly <divineearthly@gmail.com>
…llm-project#42457) Signed-off-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
…llm-project#42457) Signed-off-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
…llm-project#42457) Signed-off-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Adds a BFCLDataset that lets
vllm bench serve --backend openai-chatreplay the Berkeley Function Calling Leaderboard, so users can measure serving latency/throughput on realistic tool-calling traffic. Complements the merged correctness harness in #36560; no code overlap. See the PR description for design details.AI-assisted: drafted with Claude (Opus 4.7); author reviewed every line.
Purpose
Today there is no standardized way to measure serving latency/throughput on tool-calling workloads. Existing bench datasets (ShareGPT, sonnet, random, HF chat datasets) all produce plain-text turns — they never exercise the
tools/tool_choicepath, the server-side tool parser, or structured decoding grammars. This PR adds a first-class BFCL dataset forvllm bench serve:Test Plan
Unit tests + End-to-End smoke tests
Test Result
Unit tests — 7/7 passing:
tests/benchmarks/test_bfcl_dataset.py::test_bfcl_dataset_translates_schema_and_attaches_tools PASSED
tests/benchmarks/test_bfcl_dataset.py::test_bfcl_dataset_requires_openai_chat_backend PASSED
tests/benchmarks/test_bfcl_dataset.py::test_bfcl_dataset_missing_category_raises_clear_error PASSED
tests/benchmarks/test_bfcl_dataset.py::test_chat_backend_uses_messages_field_when_set PASSED
tests/benchmarks/test_bfcl_dataset.py::test_bfcl_prompt_len_includes_tools PASSED
tests/benchmarks/test_bfcl_dataset.py::test_bfcl_prompt_len_falls_back_when_tokenizer_rejects_tools PASSED
tests/benchmarks/test_bfcl_dataset.py::test_bfcl_schema_translation_is_recursive PASSED
End-to-end smoke against openai/gpt-oss-20b:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.