[Bench] Add BFCL dataset for vllm bench serve tool-calling workloads by laviier · Pull Request #42457 · vllm-project/vllm

laviier · 2026-05-12T20:48:21Z

Adds a BFCLDataset that lets vllm bench serve --backend openai-chat replay the Berkeley Function Calling Leaderboard, so users can measure serving latency/throughput on realistic tool-calling traffic. Complements the merged correctness harness in #36560; no code overlap. See the PR description for design details.

AI-assisted: drafted with Claude (Opus 4.7); author reviewed every line.

Purpose

Today there is no standardized way to measure serving latency/throughput on tool-calling workloads. Existing bench datasets (ShareGPT, sonnet, random, HF chat datasets) all produce plain-text turns — they never exercise the tools/tool_choice path, the server-side tool parser, or structured decoding grammars. This PR adds a first-class BFCL dataset for vllm bench serve:

Test Plan

Unit tests + End-to-End smoke tests

# Server
vllm serve openai/gpt-oss-20b --port 8000 \
  --enable-auto-tool-choice --tool-call-parser openai --reasoning-parser openai_gptoss

# Bench
vllm bench serve --model openai/gpt-oss-20b \
  --backend openai-chat --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path gorilla-llm/Berkeley-Function-Calling-Leaderboard \
  --bfcl-categories simple,live_simple,multiple \
  --num-warmups 5   --temperature 0   --percentile-metrics ttft,tpot,itl,e2el   \
  --max-concurrency 8 --num-prompts 500

Test Result

Unit tests — 7/7 passing:
tests/benchmarks/test_bfcl_dataset.py::test_bfcl_dataset_translates_schema_and_attaches_tools PASSED
tests/benchmarks/test_bfcl_dataset.py::test_bfcl_dataset_requires_openai_chat_backend PASSED
tests/benchmarks/test_bfcl_dataset.py::test_bfcl_dataset_missing_category_raises_clear_error PASSED
tests/benchmarks/test_bfcl_dataset.py::test_chat_backend_uses_messages_field_when_set PASSED
tests/benchmarks/test_bfcl_dataset.py::test_bfcl_prompt_len_includes_tools PASSED
tests/benchmarks/test_bfcl_dataset.py::test_bfcl_prompt_len_falls_back_when_tokenizer_rejects_tools PASSED
tests/benchmarks/test_bfcl_dataset.py::test_bfcl_schema_translation_is_recursive PASSED

End-to-end smoke against openai/gpt-oss-20b:

============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  36.31     
Total input tokens:                      123779    
Total generated tokens:                  52072     
Request throughput (req/s):              13.77     
Output token throughput (tok/s):         1433.99   
Peak output token throughput (tok/s):    498.00    
Peak concurrent requests:                27.00     
Total token throughput (tok/s):          4842.68   
---------------Time to First Token----------------
Mean TTFT (ms):                          78.23     
Median TTFT (ms):                        71.23     
P99 TTFT (ms):                           171.09    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.93      
Median TPOT (ms):                        4.75      
P99 TPOT (ms):                           10.18     
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.49     
Median ITL (ms):                         9.21      
P99 ITL (ms):                            146.46    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          571.74    
Median E2EL (ms):                        425.29    
P99 E2EL (ms):                           2252.42   
---------------Speculative Decoding---------------
Acceptance rate (%):                     29.83     
Acceptance length:                       3.09      
Drafts:                                  16716     
Draft tokens:                            117012    
Accepted tokens:                         34905     
Per-position acceptance (%):
  Position 0:                            71.90     
  Position 1:                            47.72     
  Position 2:                            33.15     
  Position 3:                            22.52     
  Position 4:                            16.49     
  Position 5:                            9.95      
  Position 6:                            7.08      
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces support for the Berkeley Function Calling Leaderboard (BFCL) dataset to the vLLM benchmarking suite. Key changes include the implementation of the BFCLDataset class, which manages data loading from Hugging Face, recursive translation of function schemas to OpenAI tool format, and balanced round-robin sampling across dataset categories. The benchmarking infrastructure was also updated to support pre-built chat messages and per-request overrides in SampleRequest and RequestFuncInput, enabling more accurate simulation of tool-calling scenarios. Feedback was provided regarding the silent suppression of exceptions during chat template application, recommending that errors be logged to facilitate debugging and prevent the masking of underlying tokenizer issues.

gemini-code-assist · 2026-05-12T20:51:07Z

+            except Exception:
+                rendered = None


Catching a generic Exception and silently setting rendered = None can hide important errors from tokenizer.apply_chat_template. If an unexpected error occurs, it will be suppressed, and the prompt length will be calculated using a fallback method. This can lead to inaccurate prompt length metrics and mask underlying issues in the tokenizer or chat template. It's better to log the exception to make debugging easier while maintaining robustness.

Suggested change

except Exception:

rendered = None

except Exception as e:

logger.warning("Failed to apply chat template with tools, falling back. Error: %s", e)

rendered = None

mergify · 2026-05-26T11:10:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @laviier.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-05-29T01:19:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @laviier.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-05-29T23:58:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @laviier.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

chaunceyjiang

I ran into an issue. Could you help me take a look?
Error 8: Not Found

vllm serve /mnt/data4/models/Qwen/Qwen3.5-27B-FP8 --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3

vllm bench serve --model /mnt/data4/models/Qwen/Qwen3.5-27B-FP8 \
  --backend openai-chat --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path gorilla-llm/Berkeley-Function-Calling-Leaderboard \
  --bfcl-categories simple,live_simple,multiple \
  --num-warmups 5   --temperature 0   --percentile-metrics ttft,tpot,itl,e2el   \
  --max-concurrency 8 --num-prompts 500
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7fb9711a1760>, trust_remote_code=False, seed=0, num_prompts=500, dataset_name='hf', no_stream=False, dataset_path='gorilla-llm/Berkeley-Function-Calling-Leaderboard', no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, timed_trace_chunk_hash_size=16, timed_trace_sec_multiplier=1, timed_trace_label_timestamp='timestamp', timed_trace_label_input_length='input_length', timed_trace_label_output_length='output_length', timed_trace_label_hash_ids='hash_ids', blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio='0.0', random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, bfcl_categories=['simple', 'live_simple', 'multiple'], prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, speed_bench_dataset_subset='qualitative', speed_bench_output_len=4096, speed_bench_category=None, label=None, backend='openai-chat', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/chat/completions', header=None, max_concurrency=8, model='/mnt/data4/models/Qwen/Qwen3.5-27B-FP8', input_len=None, output_len=None, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=5, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, self_timed=None, percentile_metrics='ttft,tpot,itl,e2el', metric_percentiles='99', goodput=None, request_id_prefix='bench-36142bd8-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds='25,50', plot_dataset_stats=False)
Starting initial single prompt test run...
Skipping endpoint ready check.
Warming up with 5 requests...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.37it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:03<00:00, 158.35it/s]
Failed requests during benchmark run detected (capping to 10):
Error 0: Not Found
Error 1: Not Found
Error 2: Not Found
Error 3: Not Found
Error 4: Not Found
Error 5: Not Found
Error 6: Not Found
Error 7: Not Found
Error 8: Not Found
Error 9: Not Found
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     5         
Failed requests:                         495       
Maximum request concurrency:             8         
Benchmark duration (s):                  3.16      
Total input tokens:                      2841      
Total generated tokens:                  835       
Request throughput (req/s):              1.58      
Output token throughput (tok/s):         264.43    
Peak output token throughput (tok/s):    353.00    
Peak concurrent requests:                5.00      
Total token throughput (tok/s):          1164.14   
---------------Time to First Token----------------
Mean TTFT (ms):                          580.51    
Median TTFT (ms):                        635.43    
P99 TTFT (ms):                           635.93    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.81     
Median TPOT (ms):                        10.66     
P99 TPOT (ms):                           11.43     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.75     
Median ITL (ms):                         10.69     
P99 ITL (ms):                            133.66    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          2386.45   
Median E2EL (ms):                        2376.93   
P99 E2EL (ms):                           3128.01   
==================================================

laviier · 2026-06-01T15:05:39Z

I ran into an issue. Could you help me take a look? Error 8: Not Found

vllm serve /mnt/data4/models/Qwen/Qwen3.5-27B-FP8 --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3

vllm bench serve --model /mnt/data4/models/Qwen/Qwen3.5-27B-FP8 \
  --backend openai-chat --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path gorilla-llm/Berkeley-Function-Calling-Leaderboard \
  --bfcl-categories simple,live_simple,multiple \
  --num-warmups 5   --temperature 0   --percentile-metrics ttft,tpot,itl,e2el   \
  --max-concurrency 8 --num-prompts 500
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7fb9711a1760>, trust_remote_code=False, seed=0, num_prompts=500, dataset_name='hf', no_stream=False, dataset_path='gorilla-llm/Berkeley-Function-Calling-Leaderboard', no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, timed_trace_chunk_hash_size=16, timed_trace_sec_multiplier=1, timed_trace_label_timestamp='timestamp', timed_trace_label_input_length='input_length', timed_trace_label_output_length='output_length', timed_trace_label_hash_ids='hash_ids', blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=1024, random_output_len=128, random_range_ratio='0.0', random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, bfcl_categories=['simple', 'live_simple', 'multiple'], prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, speed_bench_dataset_subset='qualitative', speed_bench_output_len=4096, speed_bench_category=None, label=None, backend='openai-chat', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/chat/completions', header=None, max_concurrency=8, model='/mnt/data4/models/Qwen/Qwen3.5-27B-FP8', input_len=None, output_len=None, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=5, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, self_timed=None, percentile_metrics='ttft,tpot,itl,e2el', metric_percentiles='99', goodput=None, request_id_prefix='bench-36142bd8-', top_p=None, top_k=None, min_p=None, temperature=0.0, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds='25,50', plot_dataset_stats=False)
Starting initial single prompt test run...
Skipping endpoint ready check.
Warming up with 5 requests...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.37it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:03<00:00, 158.35it/s]
Failed requests during benchmark run detected (capping to 10):
Error 0: Not Found
Error 1: Not Found
Error 2: Not Found
Error 3: Not Found
Error 4: Not Found
Error 5: Not Found
Error 6: Not Found
Error 7: Not Found
Error 8: Not Found
Error 9: Not Found
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     5         
Failed requests:                         495       
Maximum request concurrency:             8         
Benchmark duration (s):                  3.16      
Total input tokens:                      2841      
Total generated tokens:                  835       
Request throughput (req/s):              1.58      
Output token throughput (tok/s):         264.43    
Peak output token throughput (tok/s):    353.00    
Peak concurrent requests:                5.00      
Total token throughput (tok/s):          1164.14   
---------------Time to First Token----------------
Mean TTFT (ms):                          580.51    
Median TTFT (ms):                        635.43    
P99 TTFT (ms):                           635.93    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.81     
Median TPOT (ms):                        10.66     
P99 TPOT (ms):                           11.43     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.75     
Median ITL (ms):                         10.69     
P99 ITL (ms):                            133.66    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          2386.45   
Median E2EL (ms):                        2376.93   
P99 E2EL (ms):                           3128.01   
==================================================

Thanks for trying it. I can't reproduce locally by running the same bench flags (--bfcl-categories simple,live_simple,multiple --num-warmups 5 --temperature 0 --max-concurrency 8 --num-prompts 500) against a tool-parser-enabled Qwen/Qwen3-0.6B server gives 500/500 successful requests.

"Not Found" here is HTTP 404 from the streaming chat backend, and the possible place vLLM's chat path returns 404 on a per-request basis is OpenAIServingEngine._check_model, which fires "The model X does not exist." when the request's model field doesn't match what the engine has registered. (Source: vllm/entrypoints/openai/engine/serving.py:241.) The fact that 5 of 500 succeed but Successful requests: 5 matches --num-warmups 5 suggests the warmup requests went through fine but the main run is hitting a model-id mismatch.

Could you grab three things so I can confirm?

Server side, while it's running: curl -s http://127.0.0.1:8000/v1/models | jq '.data[].id'
One bare-bones probe with the exact same model string the bench used:

  curl -i http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "/mnt/data4/models/Qwen/Qwen3.5-27B-FP8",
        "messages":[{"role":"user","content":"hi"}],
        "max_completion_tokens": 8}'

If this returns 404, the problem is between server registration and bench --model (likely --served-model-name mismatch on the server, a stale tag, or something similar). If it returns 200, the problem is BFCL-specific and we should dig further.
3. First ~30 lines of the server log at the moment a "Not Found" request arrived (the engine logs every request route + error code).

One quick hypothesis to rule out: was the server started with any --served-model-name, or is that exact path printed by the /v1/models endpoint? If the server ended up registering a different name (canonical Hugging Face id, for example), vllm bench serve --model would 404 every request — except for the 5 warmup ones, which the warmup loop completes silently regardless of failure (asyncio.gather(*warmup_tasks) doesn't check output.success). Worth checking: did warmup actually produce useful output, or is the 5 successes you see actually unrelated to warmup at all?

chaunceyjiang · 2026-06-03T02:58:55Z

OK, I'll run some more tests today.

chaunceyjiang

LGTM.

Nice work!!

Adds a BFCLDataset that lets `vllm bench serve --backend openai-chat` replay the Berkeley Function Calling Leaderboard, so users can measure serving latency/throughput on realistic tool-calling traffic. Complements the merged correctness harness in vllm-project#36560; no code overlap. See the PR description for design details. AI-assisted: drafted with Claude (Opus 4.7); author reviewed every line. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Li Zhang <lzhanga@amazon.com>

mergify · 2026-06-09T15:19:43Z

Documentation preview: https://vllm--42457.org.readthedocs.build/en/42457/

…llm-project#42457) Signed-off-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

…llm-project#42457) Signed-off-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>

…llm-project#42457) Signed-off-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

…llm-project#42457) Signed-off-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Li Zhang <lzhanga@amazon.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com>

claude Bot reviewed May 12, 2026

View reviewed changes

mergify Bot added the performance Performance-related issues label May 12, 2026

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

laviier force-pushed the bfcl_eval branch 2 times, most recently from 898136e to 8f3dd43 Compare May 12, 2026 20:53

jeejeelee requested review from chaunceyjiang and sfeng33 May 24, 2026 03:25

mergify Bot added the needs-rebase label May 26, 2026

laviier force-pushed the bfcl_eval branch from 8f3dd43 to a5edac2 Compare May 26, 2026 21:18

mergify Bot removed the needs-rebase label May 26, 2026

mergify Bot added the needs-rebase label May 29, 2026

chaunceyjiang self-assigned this May 29, 2026

laviier force-pushed the bfcl_eval branch from a5edac2 to 36df5a4 Compare May 29, 2026 11:26

mergify Bot removed the needs-rebase label May 29, 2026

mergify Bot added the needs-rebase label May 29, 2026

laviier force-pushed the bfcl_eval branch from 36df5a4 to 8f4420c Compare May 31, 2026 19:14

mergify Bot removed the needs-rebase label May 31, 2026

chaunceyjiang added the verified Run pre-commit for new contributors without triggering other tests label Jun 1, 2026

chaunceyjiang reviewed Jun 1, 2026

View reviewed changes

laviier force-pushed the bfcl_eval branch from 8f4420c to aabe1db Compare June 1, 2026 17:28

chaunceyjiang approved these changes Jun 9, 2026

View reviewed changes

chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2026

chaunceyjiang enabled auto-merge (squash) June 9, 2026 03:01

auto-merge was automatically disabled June 9, 2026 14:44
Head branch was pushed to by a user without write access

laviier force-pushed the bfcl_eval branch from d3758a2 to f2b76c1 Compare June 9, 2026 14:44

laviier force-pushed the bfcl_eval branch from f2b76c1 to c2c4e47 Compare June 9, 2026 15:16

mergify Bot added documentation Improvements or additions to documentation tool-calling labels Jun 9, 2026

github-project-automation Bot added this to Tool Calling Jun 9, 2026

chaunceyjiang added 2 commits June 10, 2026 10:12

Merge branch 'main' into bfcl_eval

896cf8a

Merge branch 'main' into bfcl_eval

5a31831

vllm-bot merged commit 89c6a41 into vllm-project:main Jun 10, 2026
45 of 47 checks passed

github-project-automation Bot moved this to Done in Tool Calling Jun 10, 2026

laviier deleted the bfcl_eval branch June 11, 2026 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bench] Add BFCL dataset for vllm bench serve tool-calling workloads#42457

[Bench] Add BFCL dataset for vllm bench serve tool-calling workloads#42457
vllm-bot merged 3 commits into
vllm-project:mainfrom
laviier:bfcl_eval

laviier commented May 12, 2026 •

edited

Loading

claude Bot left a comment

gemini-code-assist Bot left a comment

gemini-code-assist Bot May 12, 2026

mergify Bot commented May 26, 2026

mergify Bot commented May 29, 2026

mergify Bot commented May 29, 2026

chaunceyjiang left a comment

laviier commented Jun 1, 2026

chaunceyjiang commented Jun 3, 2026

chaunceyjiang left a comment

mergify Bot commented Jun 9, 2026

Uh oh!

Labels

3 participants

Uh oh!

Uh oh!

Conversation

laviier commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

mergify Bot commented May 26, 2026

mergify Bot commented May 29, 2026

mergify Bot commented May 29, 2026

chaunceyjiang left a comment

Choose a reason for hiding this comment

laviier commented Jun 1, 2026

chaunceyjiang commented Jun 3, 2026

chaunceyjiang left a comment

Choose a reason for hiding this comment

mergify Bot commented Jun 9, 2026

Uh oh!

Labels

3 participants

laviier commented May 12, 2026 •

edited

Loading